in reply to comparing numbers from previous lines in a file?

The following code checks all fields by identifying the value least like the others in the most recent few valid lines (invalid lines are those with bad values). Bad values are reported along with their line number.

#!/usr/bin/perl use warnings; use strict; my $window = 3; my @datCols = qw(temp1 press temp2 rain); my @allCols = (@datCols, qw(line code)); my %limits = (temp1 => 0.5, press => 10, temp2 => 0.5, rain => 0.2); my @history; while (<DATA>) { my %row; (my $code, @row{@datCols}) = split " "; $row{code} = $code; $row{line} = $.; if (@history > 2) { for my $col (@datCols) { my ($outlier, $avg) = outlier($col, @history, \%row); next if abs($outlier->{$col} - $avg) < $limits{$col} * ($wi +ndow - 1); print "$col == $outlier->{$col} (c.f. $avg) in line $outli +er->{line}\n"; $outlier->{$col} = undef; } } push @history, \%row; # Toss out any bad data rows @history = grep { my $row = $_; ! grep {!defined $row->{$_}} @datCols } @history; shift @history if @history > 3; } sub outlier { my $field = shift @_; my @list = sort {$a->{$field} <=> $b->{$field}} grep {defined $_->{$field +}} @_; my $sum; my $outlier; return if !@list; $sum += $_->{$field} for @list; $sum /= @list; if (abs($sum - $list[0]->{$field}) > abs($sum - $list[-1]->{$field +})) { $outlier = $list[0]; } else { $outlier = $list[-1]; } $sum = ($sum * @list - $outlier->{$field}) / (@list - 1); return $outlier, $sum; } __DATA__ A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.1 4.3 A16 26.64 768 27.0 4.2 A11 26.62 761 26.8 4.1 A15 26.62 765 26.6 4.3 A15 26.63 763 2.4 4.2 A16 26.68 767 26.1 4.3 A17 26.64 768 25.9 4.2 A18 26.62 761 25.7 4.1

Prints:

temp2 == 2.4 (c.f. 26.8) in line 7
True laziness is hard work

Replies are listed 'Best First'.
Re^2: comparing numbers from previous lines in a file?
by ww (Archbishop) on Nov 22, 2013 at 13:13 UTC
    ++ GrandFather for a much improved approach! And likewise to all who tackled this one.

    But -- I think -- even the averaging technique demonstrated is still subject to data variability (ie, variances less than the extreme example in OP; already critiqued in responses above) and ordering. Consider outcomes like
    temp2 == 25.29 (c.f. 27.53) in line 1
    for data where one line's variation is outside OP's 0.5C range but close to another line in which the data departs by an even smaller amount from OP's limit:

    __DATA__ A16 26.68 767 25.29 4.3 # T is outside 0.5 allowed A16 26.68 767 28.01 4.3 # T is also outside the permitted range A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

    Further, assume data variance which is only marginally outside the spec:

    __DATA__ A16 26.68 767 26.79 4.3 A16 26.68 767 27.81 4.3 A15 26.62 765 27.30 4.3 A11 26.63 763 27.28 4.2 A12 26.68 767 27.29 4.3 A16 26.64 768 27.30 4.2 A11 26.62 761 27.31 4.1 A15 26.62 765 27.30 4.3 A15 26.63 763 27.28 4.2 A17 26.64 768 27.30 4.2 A18 26.62 761 27.31 4.1

    In the case above, the code outputs no indication that both Ln 1 and Ln2 of the data depart OP's 27.3C by more than 0.5.

    -----------------

    Not tested, but I suspect the OP's window = 3 is too narrow; that for complete reliability (with data close-to-but-outside-the-spec and for certain orderings of the non-spec items in the overall data, one might need to average over the overall data (or at least over a window which takes in a majority of the items *1 , rather than over a narrower window.

    *1 Even this -- I speculate -- would rely on finding the anomalous data scattered thru the set, rather than grouped and having few enough out-of-spec items as to minimize the impact of the "bad data" on whatever range of items is used for the window.

    --------

    Yes, there's lottsa' handwaving above -- but defining such terms as 'minimize', 'narrower', 'close-to-but..." and so forth would entail (for /me, anyway, hours of thought (hard) and testing (consuming time I don't have today). OTOH, I do hope the above isn't utterly worthless because I hit a blind spot earlier.

      Hi Kenosis, GrandFather, tangent, kcott, and ww!

      thank you so much for warm welcome as well as your invaluable insights. Indeed, my apologies for not being more detailed with the problem in my previous message. Reading through your posts & insights made me realize that maybe I underestimated the "problem". :)

      So... what it boils down to is the following. I am having some weather data (air temperature) transmitted wirelessly from point A to point B. The "problem" is that sometimes, there is a character (or characters) missing in a line (perhaps due to interference?), causing the numbers to show differently from what they should be in reality.

      Hence, focusing only on field #4 for this example, sometimes the real data collected could look like this (ex1) but might actually be received as x2 due to drops of characters:

      Ex1:

      A15 26.62 765 27.30 4.3

      A11 26.63 763 27.28 4.2

      A12 26.68 767 27.29 4.3

      A16 26.64 768 27.30 4.2

      A11 26.62 761 27.31 4.1

      A15 26.62 765 27.30 4.3

      A15 26.63 763 27.28 4.2

      A16 26.68 767 2.29 4.3

      A17 26.64 768 27.30 4.2

      A18 26.62 761 27.31 4.1

      Ex2:

      A15 26.62 765 2.30 4.3

      A11 26.63 763 27.8 4.2

      A12 26.68 767 27.29 4.3

      A16 26.64 768 27.30 4.2

      A11 26.62 761 7.31 4.1

      A15 26.62 765 27.30 4.3

      A15 26.63 763 27.28 4.2

      A16 26.68 767 2.29 4.3

      A17 26.64 768 27.30 4.2

      A18 26.62 761 27.31 4.1

      There are a variety of factors to tackle, including checking that there is the correct number of fields in each line, and that each field has a value within a certain range of valid data (for example, in case of temperature, this would go from 0 through 45. Note that I've already written a small piece of code that takes care of this).

      As GrandFather and others very insightfully mentioned, if I was to use the "solution" approach that I proposed initially, one very real and potential problem would be to have an anomalous data value at the start of the transmitted data, and not have enough lines to compare it with.

      One thing to consider is that in reality I have 30 lines of data, and typically less than 5 lines have character dropouts in them (if any at all) (in my example I just included 10 for simplicity's sake). Therefore, I wonder if by taking an average, frequency test, or something like that would do the trick.

      What do you guys think?

      Ahhh... as I was writing this I just thought that perhaps, one additional check would be to ensure that column #4 has a total of 2 decimal places! (since the data is always transmitted with 2 decimal places)

      PS: kscott... hehe you are right. My apologies for my initial xy approach :)

        Just fyi... I wrote a bit of code to take care of those values that do not have 2 decimal places:

        #my humble code snippet is: # for 4th column check number of decimals my @dec = split("\\.", $cols[3]); my $dec_length = length($dec[1]); if ($dec_length != 2) { print "BAD decimal length - $x1\n"; }

        After putting some more thought on this, I think I found the way to do this. What I do is just take the average of the temperature column, and then assign some kind of cutoff value (e.g.: temp_average - 5). Then, if the temp value falls below the cutoff value (which will always do if there is a character missing), then this line will be flagged.

        So far it seems to work well

        HOWEVER... here's the real head scratcher: how would you deal with two columns that have dates and time?

        For instance, assume that you have dates & times like these, which when transmitted wirelessly, you get potential dropouts:

        A3 11/20/2013 8:19:56 26.62 26.69 A4 11/20/2013 8:19:57 26.62 26.69 A5 11/20/2013 8:19:58 26.62 26.69 A7 11/20/2013 8:20:1 26.62 26.69 A9 11/20/2013 8:20:4 26.62 26.69 A10 11/20/2013 8:20:5 26.62 26.69 A12 11/20/2013 8:20:8 26.62 26.69 A13 11/20/2013 8:20:9 26.69 26.69 A14 11/20/2013 8:20:10 26.62 26.69 A16 11/20/2013 8:20:13 26.62 26.69 A18 11/20/2013 8:20:16 26.62 26.69 A23 11/20/2013 8:20:22 26.62 26.69 But, if you add "interference" that "looses" a character for the dates + & time fields, you would get: A3 11/20/2013 8:19:56 26.62 26.69 A4 1/20/2013 8:19:57 26.62 26.69 A5 11/0/2013 8:19:58 26.62 26.69 A7 11/20/2013 8:20:1 26.62 26.69 A9 11/2/2013 8:20:4 26.62 26.69 A10 11/20/2013 8:20:5 26.62 26.69 A12 11/20/2013 8:20:8 26.62 26.69 A13 11/20/2013 8:0:9 26.69 26.69 A14 11/20/2013 8:20:10 26.62 26.69 A16 1/20/2013 8:2:13 26.62 26.69 A18 11/20/2013 8:0:16 26.62 26.69 A23 11/20/2013 8:20:22 26.62 26.69 Based on this example, what would you suggest I could do to "flag" the + bad dates & times?