in reply to Re: comparing numbers from previous lines in a file?
in thread comparing numbers from previous lines in a file?

++ GrandFather for a much improved approach! And likewise to all who tackled this one.

But -- I think -- even the averaging technique demonstrated is still subject to data variability (ie, variances less than the extreme example in OP; already critiqued in responses above) and ordering. Consider outcomes like
temp2 == 25.29 (c.f. 27.53) in line 1
for data where one line's variation is outside OP's 0.5C range but close to another line in which the data departs by an even smaller amount from OP's limit:

__DATA__ A16 26.68 767 25.29 4.3 # T is outside 0.5 allowed A16 26.68 767 28.01 4.3 # T is also outside the permitted range A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

Further, assume data variance which is only marginally outside the spec:

__DATA__ A16 26.68 767 26.79 4.3 A16 26.68 767 27.81 4.3 A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

In the case above, the code outputs no indication that both Ln 1 and Ln2 of the data depart OP's 27.3C by more than 0.5.

-----------------

Not tested, but I suspect the OP's window = 3 is too narrow; that for complete reliability (with data close-to-but-outside-the-spec and for certain orderings of the non-spec items in the overall data, one might need to average over the overall data (or at least over a window which takes in a majority of the items *1 , rather than over a narrower window.

*1 Even this -- I speculate -- would rely on finding the anomalous data scattered thru the set, rather than grouped and having few enough out-of-spec items as to minimize the impact of the "bad data" on whatever range of items is used for the window.

--------

Yes, there's lottsa' handwaving above -- but defining such terms as 'minimize', 'narrower', 'close-to-but..." and so forth would entail (for /me, anyway, hours of thought (hard) and testing (consuming time I don't have today). OTOH, I do hope the above isn't utterly worthless because I hit a blind spot earlier.

Replies are listed 'Best First'.
Re^3: comparing numbers from previous lines in a file?
by coding1227 (Novice) on Nov 22, 2013 at 19:53 UTC

    Hi Kenosis, GrandFather, tangent, kcott, and ww!

    thank you so much for warm welcome as well as your invaluable insights. Indeed, my apologies for not being more detailed with the problem in my previous message. Reading through your posts & insights made me realize that maybe I underestimated the "problem". :)

    So... what it boils down to is the following. I am having some weather data (air temperature) transmitted wirelessly from point A to point B. The "problem" is that sometimes, there is a character (or characters) missing in a line (perhaps due to interference?), causing the numbers to show differently from what they should be in reality.

    Hence, focusing only on field #4 for this example, sometimes the real data collected could look like this (ex1) but might actually be received as x2 due to drops of characters:

    Ex1:

    A15 26.62 765 27.30 4.3

    A11 26.63 763 27.28 4.2

    A12 26.68 767 27.29 4.3

    A16 26.64 768 27.30 4.2

    A11 26.62 761 27.31 4.1

    A15 26.62 765 27.30 4.3

    A15 26.63 763 27.28 4.2

    A16 26.68 767 2.29 4.3

    A17 26.64 768 27.30 4.2

    A18 26.62 761 27.31 4.1

    Ex2:

    A15 26.62 765 2.30 4.3

    A11 26.63 763 27.8 4.2

    A12 26.68 767 27.29 4.3

    A16 26.64 768 27.30 4.2

    A11 26.62 761 7.31 4.1

    A15 26.62 765 27.30 4.3

    A15 26.63 763 27.28 4.2

    A16 26.68 767 2.29 4.3

    A17 26.64 768 27.30 4.2

    A18 26.62 761 27.31 4.1

    There are a variety of factors to tackle, including checking that there is the correct number of fields in each line, and that each field has a value within a certain range of valid data (for example, in case of temperature, this would go from 0 through 45. Note that I've already written a small piece of code that takes care of this).

    As GrandFather and others very insightfully mentioned, if I was to use the "solution" approach that I proposed initially, one very real and potential problem would be to have an anomalous data value at the start of the transmitted data, and not have enough lines to compare it with.

    One thing to consider is that in reality I have 30 lines of data, and typically less than 5 lines have character dropouts in them (if any at all) (in my example I just included 10 for simplicity's sake). Therefore, I wonder if by taking an average, frequency test, or something like that would do the trick.

    What do you guys think?

    Ahhh... as I was writing this I just thought that perhaps, one additional check would be to ensure that column #4 has a total of 2 decimal places! (since the data is always transmitted with 2 decimal places)

    PS: kscott... hehe you are right. My apologies for my initial xy approach :)

      Just fyi... I wrote a bit of code to take care of those values that do not have 2 decimal places:

      #my humble code snippet is: # for 4th column check number of decimals my @dec = split("\\.", $cols[3]); my $dec_length = length($dec[1]); if ($dec_length != 2) { print "BAD decimal length - $x1\n"; }

      After putting some more thought on this, I think I found the way to do this. What I do is just take the average of the temperature column, and then assign some kind of cutoff value (e.g.: temp_average - 5). Then, if the temp value falls below the cutoff value (which will always do if there is a character missing), then this line will be flagged.

      So far it seems to work well

      HOWEVER... here's the real head scratcher: how would you deal with two columns that have dates and time?

      For instance, assume that you have dates & times like these, which when transmitted wirelessly, you get potential dropouts:

      A3 11/20/2013 8:19:56 26.62 26.69 A4 11/20/2013 8:19:57 26.62 26.69 A5 11/20/2013 8:19:58 26.62 26.69 A7 11/20/2013 8:20:1 26.62 26.69 A9 11/20/2013 8:20:4 26.62 26.69 A10 11/20/2013 8:20:5 26.62 26.69 A12 11/20/2013 8:20:8 26.62 26.69 A13 11/20/2013 8:20:9 26.69 26.69 A14 11/20/2013 8:20:10 26.62 26.69 A16 11/20/2013 8:20:13 26.62 26.69 A18 11/20/2013 8:20:16 26.62 26.69 A23 11/20/2013 8:20:22 26.62 26.69 But, if you add "interference" that "looses" a character for the dates + & time fields, you would get: A3 11/20/2013 8:19:56 26.62 26.69 A4 1/20/2013 8:19:57 26.62 26.69 A5 11/0/2013 8:19:58 26.62 26.69 A7 11/20/2013 8:20:1 26.62 26.69 A9 11/2/2013 8:20:4 26.62 26.69 A10 11/20/2013 8:20:5 26.62 26.69 A12 11/20/2013 8:20:8 26.62 26.69 A13 11/20/2013 8:0:9 26.69 26.69 A14 11/20/2013 8:20:10 26.62 26.69 A16 1/20/2013 8:2:13 26.62 26.69 A18 11/20/2013 8:0:16 26.62 26.69 A23 11/20/2013 8:20:22 26.62 26.69 Based on this example, what would you suggest I could do to "flag" the + bad dates & times?
        what would you suggest I could do to "flag" the bad dates & times?

        First, take note that the hours-minutes-seconds elements of lines 4-8 in the data sample are NOT formatted with standard HH:MM:SS notation. So, first, fix that.
        Then...

        1. The regex approach below was originally written as way to attack your initial problem statement (the OP). Adapt it to check for valid dates and times (IN A STANDARD FORMAT!) Hint: Perl Cookbook and various nodes here will show you a method.

        #!/usr/bin/perl use 5.016; use warnings; use Data::Dumper; # 1063982 NB: Checks only T2 for range +\- 0.5C. my ( $cols, $ID, $T1, $Press, $T2, $LastItem); my @cols = ('A16 26.64 68 27.30 4.2', 'A15 26.62 765 2.30 4.3', 'A11 26.62 761 7.31 4.1', 'A11 26.63 763 27.8 4.2', 'A12 26.68 767 27.29 4.3', 'A15 26.62 765 27.30 4.3', 'A15 26.63 763 27.28 4.2', 'A16 26.68 767 2.29 4.3', 'A17 26.64 768 27.30 4.2', 'A18 26.62 761 27.31 41', 'A211 26.73 764 27.39 4.4', 'A22 26.59 760 27.3 4.0', 'A23 26.54 765 27.84 4.1', ); for $cols(@cols) { if ( $cols =~ /^(A\d\d)\s(2\d\.\d\d)\s(7\d\d)\s(2\d\.\d\d)\s(\d\.\ +d)$/ ) { $ID = $1; $T1 = $2; $Press= $3; $T2 = $4; if ( $T2 < 26.80 ) { $T2 .= " out of range"; } elsif ( $T2 >27.80 ) { $T2 .= " is out of range"; } say "\$T2, $T2 in ID $ID\n"; $LastItem = $5; } elsif ($cols =~ /^(A\d{2,2})\s.*/ ) { $ID = $1; say "In $ID, BAD VALUE(s) within $cols\n"; $ID = ''; } else { say "BAD VALUES somewhere in $cols\n"; } } =head output (errors highlighted): In A16, BAD VALUE(s) within A16 26.64 68 27.30 4.2 /\ In A15, BAD VALUE(s) within A15 26.62 765 2.30 4.3 /\ In A11, BAD VALUE(s) within A11 26.62 761 7.31 4.1 /\ In A11, BAD VALUE(s) within A11 26.63 763 27.8 4.2 /\ $T2, 27.29 in ID A12 $T2, 27.30 in ID A15 $T2, 27.28 in ID A15 In A16, BAD VALUE(s) within A16 26.68 767 2.29 4.3 /\ $T2, 27.30 in ID A17 In A18, BAD VALUE(s) within A18 26.62 761 27.31 41 /\ BAD VALUES somewhere in A211 26.73 764 27.39 4.4 /\ In A22, BAD VALUE(s) within A22 26.59 760 27.3 4.0 $T2, 27.84 is out of range in ID A23 =cut
              or

        2. Compare each date and time to its neighbors perhaps by converting to epoch seconds and requiring that all 3 fall within 60*60*24*n of each other where n is the number of days on which you'll acquire data between each data reduction. Note, though, that this scheme will suffer the same failings as were previously discussed with respect to multiple, consecutive errors.

        Note, also, that while Ln 19 is obviously an error, Ln 20 (aside from not being in MM/DD/YYYY which is itself a regionalism -- or, preferably YYYY/MM/DD) is merely suspect without knowing the allowed range of dates. So too is the date in Ln 26 (but there, the time is an obvious error).