in reply to Re^3: comparing numbers from previous lines in a file?
in thread comparing numbers from previous lines in a file?

Just fyi... I wrote a bit of code to take care of those values that do not have 2 decimal places:

#my humble code snippet is: # for 4th column check number of decimals my @dec = split("\\.", $cols[3]); my $dec_length = length($dec[1]); if ($dec_length != 2) { print "BAD decimal length - $x1\n"; }

After putting some more thought on this, I think I found the way to do this. What I do is just take the average of the temperature column, and then assign some kind of cutoff value (e.g.: temp_average - 5). Then, if the temp value falls below the cutoff value (which will always do if there is a character missing), then this line will be flagged.

So far it seems to work well

HOWEVER... here's the real head scratcher: how would you deal with two columns that have dates and time?

For instance, assume that you have dates & times like these, which when transmitted wirelessly, you get potential dropouts:

A3 11/20/2013 8:19:56 26.62 26.69 A4 11/20/2013 8:19:57 26.62 26.69 A5 11/20/2013 8:19:58 26.62 26.69 A7 11/20/2013 8:20:1 26.62 26.69 A9 11/20/2013 8:20:4 26.62 26.69 A10 11/20/2013 8:20:5 26.62 26.69 A12 11/20/2013 8:20:8 26.62 26.69 A13 11/20/2013 8:20:9 26.69 26.69 A14 11/20/2013 8:20:10 26.62 26.69 A16 11/20/2013 8:20:13 26.62 26.69 A18 11/20/2013 8:20:16 26.62 26.69 A23 11/20/2013 8:20:22 26.62 26.69 But, if you add "interference" that "looses" a character for the dates + & time fields, you would get: A3 11/20/2013 8:19:56 26.62 26.69 A4 1/20/2013 8:19:57 26.62 26.69 A5 11/0/2013 8:19:58 26.62 26.69 A7 11/20/2013 8:20:1 26.62 26.69 A9 11/2/2013 8:20:4 26.62 26.69 A10 11/20/2013 8:20:5 26.62 26.69 A12 11/20/2013 8:20:8 26.62 26.69 A13 11/20/2013 8:0:9 26.69 26.69 A14 11/20/2013 8:20:10 26.62 26.69 A16 1/20/2013 8:2:13 26.62 26.69 A18 11/20/2013 8:0:16 26.62 26.69 A23 11/20/2013 8:20:22 26.62 26.69 Based on this example, what would you suggest I could do to "flag" the + bad dates & times?

Replies are listed 'Best First'.
Re^5: comparing numbers from previous lines in a file?
by ww (Archbishop) on Nov 23, 2013 at 03:45 UTC
    what would you suggest I could do to "flag" the bad dates & times?

    First, take note that the hours-minutes-seconds elements of lines 4-8 in the data sample are NOT formatted with standard HH:MM:SS notation. So, first, fix that.
    Then...

    1. The regex approach below was originally written as way to attack your initial problem statement (the OP). Adapt it to check for valid dates and times (IN A STANDARD FORMAT!) Hint: Perl Cookbook and various nodes here will show you a method.

    #!/usr/bin/perl use 5.016; use warnings; use Data::Dumper; # 1063982 NB: Checks only T2 for range +\- 0.5C. my ( $cols, $ID, $T1, $Press, $T2, $LastItem); my @cols = ('A16 26.64 68 27.30 4.2', 'A15 26.62 765 2.30 4.3', 'A11 26.62 761 7.31 4.1', 'A11 26.63 763 27.8 4.2', 'A12 26.68 767 27.29 4.3', 'A15 26.62 765 27.30 4.3', 'A15 26.63 763 27.28 4.2', 'A16 26.68 767 2.29 4.3', 'A17 26.64 768 27.30 4.2', 'A18 26.62 761 27.31 41', 'A211 26.73 764 27.39 4.4', 'A22 26.59 760 27.3 4.0', 'A23 26.54 765 27.84 4.1', ); for $cols(@cols) { if ( $cols =~ /^(A\d\d)\s(2\d\.\d\d)\s(7\d\d)\s(2\d\.\d\d)\s(\d\.\ +d)$/ ) { $ID = $1; $T1 = $2; $Press= $3; $T2 = $4; if ( $T2 < 26.80 ) { $T2 .= " out of range"; } elsif ( $T2 >27.80 ) { $T2 .= " is out of range"; } say "\$T2, $T2 in ID $ID\n"; $LastItem = $5; } elsif ($cols =~ /^(A\d{2,2})\s.*/ ) { $ID = $1; say "In $ID, BAD VALUE(s) within $cols\n"; $ID = ''; } else { say "BAD VALUES somewhere in $cols\n"; } } =head output (errors highlighted): In A16, BAD VALUE(s) within A16 26.64 68 27.30 4.2 /\ In A15, BAD VALUE(s) within A15 26.62 765 2.30 4.3 /\ In A11, BAD VALUE(s) within A11 26.62 761 7.31 4.1 /\ In A11, BAD VALUE(s) within A11 26.63 763 27.8 4.2 /\ $T2, 27.29 in ID A12 $T2, 27.30 in ID A15 $T2, 27.28 in ID A15 In A16, BAD VALUE(s) within A16 26.68 767 2.29 4.3 /\ $T2, 27.30 in ID A17 In A18, BAD VALUE(s) within A18 26.62 761 27.31 41 /\ BAD VALUES somewhere in A211 26.73 764 27.39 4.4 /\ In A22, BAD VALUE(s) within A22 26.59 760 27.3 4.0 $T2, 27.84 is out of range in ID A23 =cut
          or

    2. Compare each date and time to its neighbors perhaps by converting to epoch seconds and requiring that all 3 fall within 60*60*24*n of each other where n is the number of days on which you'll acquire data between each data reduction. Note, though, that this scheme will suffer the same failings as were previously discussed with respect to multiple, consecutive errors.

    Note, also, that while Ln 19 is obviously an error, Ln 20 (aside from not being in MM/DD/YYYY which is itself a regionalism -- or, preferably YYYY/MM/DD) is merely suspect without knowing the allowed range of dates. So too is the date in Ln 26 (but there, the time is an obvious error).

      Thanks ww for your suggestions.

      Your idea of turning the dates into timestamps was key, since this will allow me to compare the values much more easily than if they were formatted in calendar/time form.

      Also, I was thinking that if there is a way that I can format the time and hour values from the original data so that they follow the dd/mm/yyyy hh/mm/ss format explicitly, this could also make things a bit easier from my end because it would allow me to easily check that all the values between the "/" are 2 digits long (except for the yyyy, of course). Anyways, just something that I'm thinking on :)

      Thank you again for your incredibly valuable guidance on this!!