Re^3: comparing numbers from previous lines in a file?

Hi Kenosis, GrandFather, tangent, kcott, and ww!

thank you so much for warm welcome as well as your invaluable insights. Indeed, my apologies for not being more detailed with the problem in my previous message. Reading through your posts & insights made me realize that maybe I underestimated the "problem". :)

So... what it boils down to is the following. I am having some weather data (air temperature) transmitted wirelessly from point A to point B. The "problem" is that sometimes, there is a character (or characters) missing in a line (perhaps due to interference?), causing the numbers to show differently from what they should be in reality.

Hence, focusing only on field #4 for this example, sometimes the real data collected could look like this (ex1) but might actually be received as x2 due to drops of characters:

Ex1:

A15 26.62 765 27.30 4.3

A11 26.63 763 27.28 4.2

A12 26.68 767 27.29 4.3

A16 26.64 768 27.30 4.2

A11 26.62 761 27.31 4.1

A15 26.62 765 27.30 4.3

A15 26.63 763 27.28 4.2

A16 26.68 767 2.29 4.3

A17 26.64 768 27.30 4.2

A18 26.62 761 27.31 4.1

Ex2:

A15 26.62 765 2.30 4.3

A11 26.63 763 27.8 4.2

A12 26.68 767 27.29 4.3

A16 26.64 768 27.30 4.2

A11 26.62 761 7.31 4.1

A15 26.62 765 27.30 4.3

A15 26.63 763 27.28 4.2

A16 26.68 767 2.29 4.3

A17 26.64 768 27.30 4.2

A18 26.62 761 27.31 4.1

There are a variety of factors to tackle, including checking that there is the correct number of fields in each line, and that each field has a value within a certain range of valid data (for example, in case of temperature, this would go from 0 through 45. Note that I've already written a small piece of code that takes care of this).

As GrandFather and others very insightfully mentioned, if I was to use the "solution" approach that I proposed initially, one very real and potential problem would be to have an anomalous data value at the start of the transmitted data, and not have enough lines to compare it with.

One thing to consider is that in reality I have 30 lines of data, and typically less than 5 lines have character dropouts in them (if any at all) (in my example I just included 10 for simplicity's sake). Therefore, I wonder if by taking an average, frequency test, or something like that would do the trick.

What do you guys think?

Ahhh... as I was writing this I just thought that perhaps, one additional check would be to ensure that column #4 has a total of 2 decimal places! (since the data is always transmitted with 2 decimal places)

PS: kscott... hehe you are right. My apologies for my initial xy approach :)

Comment on Re^3: comparing numbers from previous lines in a file?

Replies are listed 'Best First'.
Re^4: comparing numbers from previous lines in a file? by coding1227 (Novice) on Nov 22, 2013 at 20:26 UTC
Just fyi... I wrote a bit of code to take care of those values that do not have 2 decimal places: `#my humble code snippet is: # for 4th column check number of decimals my @dec = split("\\.", $cols[3]); my $dec_length = length($dec[1]); if ($dec_length != 2) { print "BAD decimal length - $x1\n"; }` [download] After putting some more thought on this, I think I found the way to do this. What I do is just take the average of the temperature column, and then assign some kind of cutoff value (e.g.: temp_average - 5). Then, if the temp value falls below the cutoff value (which will always do if there is a character missing), then this line will be flagged. So far it seems to work well HOWEVER... here's the real head scratcher: how would you deal with two columns that have dates and time? For instance, assume that you have dates & times like these, which when transmitted wirelessly, you get potential dropouts: A3 11/20/2013 8:19:56 26.62 26.69 A4 11/20/2013 8:19:57 26.62 26.69 A5 11/20/2013 8:19:58 26.62 26.69 A7 11/20/2013 8:20:1 26.62 26.69 A9 11/20/2013 8:20:4 26.62 26.69 A10 11/20/2013 8:20:5 26.62 26.69 A12 11/20/2013 8:20:8 26.62 26.69 A13 11/20/2013 8:20:9 26.69 26.69 A14 11/20/2013 8:20:10 26.62 26.69 A16 11/20/2013 8:20:13 26.62 26.69 A18 11/20/2013 8:20:16 26.62 26.69 A23 11/20/2013 8:20:22 26.62 26.69 But, if you add "interference" that "looses" a character for the dates + & time fields, you would get: A3 11/20/2013 8:19:56 26.62 26.69 A4 1/20/2013 8:19:57 26.62 26.69 A5 11/0/2013 8:19:58 26.62 26.69 A7 11/20/2013 8:20:1 26.62 26.69 A9 11/2/2013 8:20:4 26.62 26.69 A10 11/20/2013 8:20:5 26.62 26.69 A12 11/20/2013 8:20:8 26.62 26.69 A13 11/20/2013 8:0:9 26.69 26.69 A14 11/20/2013 8:20:10 26.62 26.69 A16 1/20/2013 8:2:13 26.62 26.69 A18 11/20/2013 8:0:16 26.62 26.69 A23 11/20/2013 8:20:22 26.62 26.69 Based on this example, what would you suggest I could do to "flag" the + bad dates & times? [download]	[reply] [d/l] [select]
Re^5: comparing numbers from previous lines in a file? by ww (Archbishop) on Nov 23, 2013 at 03:45 UTC
what would you suggest I could do to "flag" the bad dates & times? First, take note that the hours-minutes-seconds elements of lines 4-8 in the data sample are NOT formatted with standard HH:MM:SS notation. So, first, fix that. Then... 1. The regex approach below was originally written as way to attack your initial problem statement (the OP). Adapt it to check for valid dates and times (IN A STANDARD FORMAT!) Hint: Perl Cookbook and various nodes here will show you a method. #!/usr/bin/perl use 5.016; use warnings; use Data::Dumper; # 1063982 NB: Checks only T2 for range +\- 0.5C. my ( $cols, $ID, $T1, $Press, $T2, $LastItem); my @cols = ('A16 26.64 68 27.30 4.2', 'A15 26.62 765 2.30 4.3', 'A11 26.62 761 7.31 4.1', 'A11 26.63 763 27.8 4.2', 'A12 26.68 767 27.29 4.3', 'A15 26.62 765 27.30 4.3', 'A15 26.63 763 27.28 4.2', 'A16 26.68 767 2.29 4.3', 'A17 26.64 768 27.30 4.2', 'A18 26.62 761 27.31 41', 'A211 26.73 764 27.39 4.4', 'A22 26.59 760 27.3 4.0', 'A23 26.54 765 27.84 4.1', ); for $cols(@cols) { if ( $cols =~ /^(A\d\d)\s(2\d\.\d\d)\s(7\d\d)\s(2\d\.\d\d)\s(\d\.\ +d)$/ ) { $ID = $1; $T1 = $2; $Press= $3; $T2 = $4; if ( $T2 < 26.80 ) { $T2 .= " out of range"; } elsif ( $T2 >27.80 ) { $T2 .= " is out of range"; } say "\$T2, $T2 in ID $ID\n"; $LastItem = $5; } elsif ($cols =~ /^(A\d{2,2})\s./ ) { $ID = $1; say "In $ID, BAD VALUE(s) within $cols\n"; $ID = ''; } else { say "BAD VALUES somewhere in $cols\n"; } } =head output (errors highlighted): In A16, BAD VALUE(s) within A16 26.64 68 27.30 4.2 /\ In A15, BAD VALUE(s) within A15 26.62 765 2.30 4.3 /\ In A11, BAD VALUE(s) within A11 26.62 761 7.31 4.1 /\ In A11, BAD VALUE(s) within A11 26.63 763 27.8 4.2 /\ $T2, 27.29 in ID A12 $T2, 27.30 in ID A15 $T2, 27.28 in ID A15 In A16, BAD VALUE(s) within A16 26.68 767 2.29 4.3 /\ $T2, 27.30 in ID A17 In A18, BAD VALUE(s) within A18 26.62 761 27.31 41 /\ BAD VALUES somewhere in A211 26.73 764 27.39 4.4 /\ In A22, BAD VALUE(s) within A22 26.59 760 27.3 4.0 $T2, 27.84 is out of range in ID A23 =cut [download] or 2. Compare each date and time to its neighbors perhaps by converting to epoch seconds and requiring that all 3 fall within 606024n of each other where n is the number of days on which you'll acquire data between each data reduction. Note, though, that this scheme will suffer the same failings as were previously discussed with respect to multiple, consecutive errors. Note, also, that while Ln 19 is obviously an error, Ln 20 (aside from not being in MM/DD/YYYY which is itself a regionalism -- or, preferably YYYY/MM/DD) is merely suspect without knowing the allowed range of dates. So too is the date in Ln 26 (but there, the time is an obvious error).	[reply] [d/l]
Re^6: comparing numbers from previous lines in a file? by coding1227 (Novice) on Nov 26, 2013 at 20:13 UTC
Thanks ww for your suggestions. Your idea of turning the dates into timestamps was key, since this will allow me to compare the values much more easily than if they were formatted in calendar/time form. Also, I was thinking that if there is a way that I can format the time and hour values from the original data so that they follow the dd/mm/yyyy hh/mm/ss format explicitly, this could also make things a bit easier from my end because it would allow me to easily check that all the values between the "/" are 2 digits long (except for the yyyy, of course). Anyways, just something that I'm thinking on :) Thank you again for your incredibly valuable guidance on this!!	[reply]