Comparing Dates and Reoccurance

tuakilan has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

This is to follow up from on http://www.perlmonks.org/?node_id=673673 where i posted questions on how to deal with dates and re occurrences

for a newbie like me i am trying to finish an assignment and i am pulling my hairs :(

The task

read a raw ASCII log file which was collected by a toll collecting machine.

from the log file, using "tids" and "channel" as the key, locate records that are longer than 3600 seconds.

record down how many times such incident happened and identify it as 'occurrences'

output the result in the order as shown in 'report-2007-01-01.txt'.

in SQL statement, it look similar to this

select * from
where channel = seven
and time > 3600 seconds
commit;

Exact raw ASCII logfile from toll collecting machine, tollog-2007-jan-01.txt

2008-Jan-01 00:00:00 UTC (GMT +0000) - Toll: channel = seven, ref = xx
+x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 123456789
2008-Jan-01 00:10:00 UTC (GMT +0000) - Toll: channel = six, ref = xxx.
+xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 987654321
2008-Jan-01 00:20:00 UTC (GMT +0000) - Toll: channel = three, ref = xx
+x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 223344221
2008-Jan-01 00:30:00 UTC (GMT +0000) - Toll: channel = four, ref = xxx
+.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 998829992
2008-Jan-01 00:40:00 UTC (GMT +0000) - Toll: channel = three, ref = xx
+x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 938874724
2008-Jan-01 00:50:00 UTC (GMT +0000) - Toll: channel = two, ref = xxx.
+xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 229928828
2008-Jan-01 01:00:00 UTC (GMT +0000) - Toll: channel = five, ref = xxx
+.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 998822992
2008-Jan-01 01:10:00 UTC (GMT +0000) - Toll: channel = seven, ref = xx
+x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 123456789
[download]

As you can see from the above, record 1 and 8 are the output which are desired as these 2 records has the same channel name and tids number.

Desired report file : report-2007-01-01.txt

TIDS               time                    Occurance
====================================================
123456789          2008-Jan-01 01:10:00     2
[download]

So far what i did was the following but it wrote a zero byte size file :(

#!/usr/local/bin/perl -w
use strict;
use warnings;
use Time::Local;
my $infile = 'input.2008-01-01.log';
my $outfile = 'output.2008-01-01.log';
my($fh_out, $fh);
open($fh_out, '>', $outfile) or die "Could not open outfile: $!";
open($fh, '<', $infile) or die "Could not open logfile: $!";
my %track;
while (<$fh>){
  my ($date,$ignoreIDLiteral,$id) = split / - | = /;
  chomp $id;
  my   $time = dateconv($date);
  my $prevtime = $track{$id}{TIME};
  $track{$id}{TIME}=$time;
  $track{$id}{DATE}=$date;
  $track{$id}{COUNT}++;
  print "$id\t$date\t$track{$id}{COUNT}\n"
      if $prevtime and $time - $prevtime > 3600;

}
sub dateconv{
  my $d = shift;
  my %month = qw[jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7
                 aug 8 sep 9 oct 10 nov 11 dec 12];
  my @p = $d=~/(\d+)-(\w+)-(\d+)\s(\d+):(\d+):(\d+)/;
  $p[1]=$month{ lc $p[1]  } - 1;
  return  timelocal(@p[5,4,3,2,1,0]);
#timelocal($sec,$min,$hour,$mday,$mon,$year);
}
close $fh_out;
close $fh;
[download]

I think I messed up with the regex of the incoming logfile. Anyone can correct me where i did wrong ?

Comment on Comparing Dates and Reoccurance - Part II Select or Download Code

Replies are listed 'Best First'.
Re: Comparing Dates and Reoccurance - Part II by mr_mischief (Monsignor) on Mar 13, 2008 at 02:55 UTC
Netwallah did so in Re^3: Comparing Dates and Reoccurance. You are not writing to the file `$fh_out`. Change this line: `print "$id\t$date\t$track{$id}{COUNT}\n" if $prevtime and $time - $prevtime > 3600;` [download] to this: `print $fh_out "$id\t$date\t$track{$id}{COUNT}\n" if $prevtime and $time - $prevtime > 3600;` [download] Take a look at print for more information. The proper place to follow this up was probably as a follow-up in your existing thread.	[reply] [d/l] [select]
Re: Comparing Dates and Reoccurance - Part II by NetWallah (Canon) on Mar 13, 2008 at 03:18 UTC
My previous response indicated why you were not getting output. Anyway, at this point, since I'm whoring for XP to make parson, here is a working solution: (Change the input file name). #!/usr/local/bin/perl -w use strict; use warnings; use Time::Local; my $infile = 'test-logdata.txt'; my $outfile = 'output.2008-01-01.log'; my($fh_out, $fh); open($fh_out, '>', $outfile) or die "Could not open outfile: $!"; open($fh, '<', $infile) or die "Could not open logfile: $!"; my %track; my $header=<<HEADER; TIDS time Occurance ==================================================== HEADER print $fh_out $header; while (<$fh>){ my ($date,$channel,$id) = /^(\S+\s\S+).+channel = (\w+).+tids = (\w+ +)/; my $time = dateconv($date); my $prevtime = $track{$id}{TIME}; $track{$id}{TIME}=$time; $track{$id}{DATE}=$date; if ($prevtime and $time - $prevtime > 3600){ $track{$id}{COUNT}++; print $fh_out "$id\t$date\t$track{$id}{COUNT}\n" ; } } sub dateconv{ my $d = shift; my %month = qw[jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7 aug 8 sep 9 oct 10 nov 11 dec 12]; my @p = $d=~/(\d+)-(\w+)-(\d+)\s(\d+):(\d+):(\d+)/; $p[1]=$month{ lc $p[1] } - 1; return timelocal(@p[5,4,3,2,1,0]); #timelocal($sec,$min,$hour,$mday,$mon,$year); } close $fh_out; close $fh; [download] "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom	[reply] [d/l]
Re^2: Comparing Dates and Reoccurance - Part II by tuakilan (Acolyte) on Mar 13, 2008 at 03:46 UTC
good day netwallah, appreciate much of your help here .... but it produce no result other than the header that reads the following : `IDS TIME OCCURANCES ===================================================` [download] am I missing something here ? #!/usr/local/bin/perl -w use strict; use warnings; use Time::Local; my $infile = 'input.2008-01-01.log'; my $outfile = 'output.2008-01-01.log'; my $channel = 'seven'; my($fh_out, $fh); open($fh, '<', $infile) or die "Could not open logfile: $!"; open($fh_out, '>', $outfile) or die "Could not open outfile: $!"; my %track; while (<$fh>){ next unless /$channel/; my ($date,$channel,$id) = /^(\S+\s\S+).+channel = (\w+).+id = (\w+)/ +; my $time = dateconv($id); my $prevtime = $track{$id}{TIME}; $track{$id}{TIME}=$time; $track{$id}{DATE}=$date; if ($prevtime and $time - $prevtime > 3600){ $track{$id}{COUNT}++; print $fh_out "$id\t$date\t$track{$id}{COUNT}\n" ; } } sub dateconv{ my $d = shift; my %month = qw[jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7 aug 8 sep 9 oct 10 nov 11 dec 12]; my @p = $d=~/(\d+)-(\w+)-(\d+)\s(\d+):(\d+):(\d+)/; $p[1]=$month{ lc $p[1] } - 1; return timelocal(@p[5,4,3,2,1,0]); #timelocal($sec,$min,$hour,$mday,$mon,$year); } my $header=<<HEADER; ID TIME OCCURANCES ===================================================== HEADER print $fh_out $header; close $fh_out; close $fh; [download]	[reply] [d/l] [select]
Re^3: Comparing Dates and Reoccurance - Part II by NetWallah (Canon) on Mar 13, 2008 at 15:40 UTC
Yes, you are missing something ;-) You added a line to my code: `next unless /$channel/;` [download] The problem is that line is in the wrong place, and probably does not do what you think it is doing. You are checking the value of $channel before that value is set as intended by my original code. You have also declared $channel twice. The other line you mangled now reads `my $time = dateconv($id);` [download] Please try to understand the code, before you choose to modify it. Now that I have made parson, I'm less motivated to spoon-feed beyond this point. Update: Use mr_mischief's(++) excellent update to my code. "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom	[reply] [d/l] [select]
Re: Comparing Dates and Reoccurance - Part II by dwm042 (Priest) on Mar 13, 2008 at 05:47 UTC
The fundamental issue with your code is the split command, which leaves $id with the wrong stuff. You're parsing the date okay. A slight modification of the split, to something like: `split / - \| tids = /;` [download] And your code should work fine. This modification of your original produces the kind of output you desire: #!/usr/bin/perl use strict; use warnings; use Time::Local; my $debug = shift \|\| 0; my $header = 0; my $outfile = 'output.2008-01-01.log'; my($fh_out, $fh); open($fh_out, '>', $outfile) or die "Could not open outfile: $!"; my %track; while (<DATA>){ my ($date,$ignoreIDLiteral,$id) = split / - \| tids = /; if ( $debug ) { print "Date = $date\n"; print "Literal = $ignoreIDLiteral\n"; print "Id = $id\n"; } chomp $id; my $time = dateconv($date); my $prevtime = $track{$id}{TIME}; $track{$id}{TIME}=$time; $track{$id}{DATE}=$date; $track{$id}{COUNT}++; if ( $prevtime and ( $time - $prevtime > 3600 ) ) { unless ( $header ) { print "TIDS time Occurance\ +n"; print "====================================================\ +n"; $header = 1; } print "$id\t$date\t$track{$id}{COUNT}\n" } } sub dateconv{ my $d = shift; my %month = qw[jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7 aug 8 sep 9 oct 10 nov 11 dec 12]; my @p = $d=~/(\d+)-(\w+)-(\d+)\s(\d+):(\d+):(\d+)/; $p[1]=$month{ lc $p[1] } - 1; return timelocal(@p[5,4,3,2,1,0]); } close $fh_out; __DATA__ 2008-Jan-01 00:00:00 UTC (GMT +0000) - Toll: channel = seven, ref = xx +x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 123456789 2008-Jan-01 00:10:00 UTC (GMT +0000) - Toll: channel = six, ref = xxx. +xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 987654321 2008-Jan-01 00:20:00 UTC (GMT +0000) - Toll: channel = three, ref = xx +x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 223344221 2008-Jan-01 00:30:00 UTC (GMT +0000) - Toll: channel = four, ref = xxx +.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 998829992 2008-Jan-01 00:40:00 UTC (GMT +0000) - Toll: channel = three, ref = xx +x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 938874724 2008-Jan-01 00:50:00 UTC (GMT +0000) - Toll: channel = two, ref = xxx. +xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 229928828 2008-Jan-01 01:00:00 UTC (GMT +0000) - Toll: channel = five, ref = xxx +.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 998822992 2008-Jan-01 01:10:00 UTC (GMT +0000) - Toll: channel = seven, ref = xx +x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 123456789 [download] Output is: `./read_tolls.pl TIDS time Occurance ==================================================== 123456789 2008-Jan-01 01:10:00 UTC (GMT +0000) 2` [download]	[reply] [d/l] [select]
Re^2: Comparing Dates and Reoccurance - Part II by tuakilan (Acolyte) on Mar 13, 2008 at 06:18 UTC
hi dwm, i edited your code to the below and when i ran it with root profile, still reply the header alone and without much info :( #!/usr/local/bin/perl -w use strict; use warnings; use Time::Local; my $debug = shift \|\| 0; my $header = 0; my $infile = 'tollog-2007-jan-01.txt'; my $outfile = 'report-2007-01-01.txt'; my($fh_out, $fh); open($fh, '<', $infile) or die "Could not open logfile: $!"; open($fh_out, '>', $outfile) or die "Could not open outfile: $!"; my %track; while (<$fh>){ my ($date,$ignoreIDLiteral,$id) = split / - \| id = /; if ( $debug ) { print "Date = $date\n"; print "Literal = $ignoreIDLiteral\n"; print "ID = $id\n"; } chomp $id; my $time = dateconv($date); my $prevtime = $track{$id}{TIME}; $track{$id}{TIME}=$time; $track{$id}{DATE}=$date; $track{$id}{COUNT}++; if ( $prevtime and ( $time - $prevtime > 3600 ) ) { unless ( $header ) { print "TIDS TIME OCCURANCE\ +n"; print "====================================================\ +n"; $header = 1; } print "$id\t$date\t$track{$id}{COUNT}\n" } } sub dateconv{ my $d = shift; my %month = qw[jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7 aug 8 sep 9 oct 10 nov 11 dec 12]; my @p = $d=~/(\d+)-(\w+)-(\d+)\s(\d+):(\d+):(\d+)/; $p[1]=$month{ lc $p[1] } - 1; return timelocal(@p[5,4,3,2,1,0]); } close $fh_out; [download] by the way, what does $ignoreIDLiteral to mean over here ? does the ID refers to the ID field in the source data ?	[reply] [d/l]
Re: Comparing Dates and Reoccurance - Part II by ack (Deacon) on Mar 13, 2008 at 05:33 UTC
I don't see anywhere that you print to $fh_out...you open it for writing (which, since you're using '>' in your open, also creates the file if it doesn't already exist), but all of your print statements are to STDOUT (by default, since you don't specify a filehandle to print to). So you never print anything to the output file. If I understand what you're doing, I think your print statement in your while() loop should read: `print $fh_out "$id\t$date\t$track{$id}{COUNT}\n" if $prevtime and $time - $prevtime > 3600;` [download] I just entered your code, created your input file and ran the code against it...with the print to the proper filehandle...and it correctly wrote the output to the file. However, the information that it wrote was not what you show you're expecting. So there are probably other problems with your code. Hence, getting the print statement right solves the "empty file" challenge. Getting the information written to be what you're looking for is another matter. ack Albuquerque, NM	[reply] [d/l]
Re: Comparing Dates and Reoccurance - Part II by mr_mischief (Monsignor) on Mar 13, 2008 at 16:52 UTC
Why would you print the number of occurrences in the loop, in which case every additional occurrence gets listed with the current count of past occurrences? I think you need to build your data structure completely then iterate over it. Try this on for size: #!/usr/bin/perl -- use strict; use warnings; use Time::Local; use POSIX qw( strftime ); my %conf = ( 'input' => 'input.2008-01-01.log', 'output' => 'output.2008-01-01.log', 'duration' => 3600, ); my %track; sub dateconv { my ( $date, $time ) = @_; my %months = qw( Jan 01 Feb 02 Mar 03 Apr 04 May 05 Jun 06 Jul 07 +Aug 08 Sep 09 Oct 10 Nov 11 Dec 12 ); my @parts = reverse split /:/, $time; push @parts, reverse split /-/, $date; $parts[4] = $months{ $parts[4] } - 1; return timelocal( @parts ); } open ( my $in, '<', $conf{ 'input' } ) or die 'Cannot open input file + ' . $conf{ 'input' } . ": $!\n"; while ( <$in> ) { chomp; # Example input: # 2008-Jan-01 00:00:00 UTC (GMT +0000) - Toll: channel = seven, re +f = xxx.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 123456 +789 if ( /(\d{4}-\w{3}-\d{2})\s (\d{2}:\d{2}:\d{2})\s \w+\s $GMT\s([\+\-]\d{4})$\s -\sToll:\schannel\s=\s(\w+),\s ref\s=\s\S+,\s tids\s=\s(\d+) /x ) { my ( $date, $time, $offset ) = ( $1, $2, $3 ); my ( $channel, $id ) = ( $4, $5 ); my $e_time = dateconv( $date, $time ); if ( defined $track{ $channel }{ $id } ) { if ( $e_time - $track{ $channel }{ $id }{ 'time' } > $conf +{ 'duration' } ) { $track{ $channel }{ $id }{ 'occurrences' }++; } } else { $track{ $channel }{ $id }{ 'time' } = $e_time; $track{ $channel }{ $id }{ 'occurrences' } = 1; } } else { print "line does not match!\n"; } } close $in; open ( my $out, '>', $conf{ 'output' } ) or die 'Cannot open output fi +le ' . $conf{ 'input' } . ": $!\n"; print $out <<_HEADER; TIDS time Occurrence ==================================================== _HEADER foreach my $channel ( sort keys %track ) { foreach my $id ( sort keys %{ $track{$channel} } ) { if ( $track{$channel}{$id}{ 'occurrences' } > 1 ) { my $date_time = POSIX::strftime( '%Y-%b-%d %H:%M:%S', ( lo +caltime( $track{$channel}{$id}{ 'time' } ) ) ); print $out "$id\t\t$date_time\t" . $track{$channel}{$id}{ +'occurrences' } . "\n"; } } } close $out; __END__ [download] I've made a few slight modifications which don't necessarily reflect your errors, but which reflect how I'd attack the problem: I'm using a regex for the log line, which should be a bit more flexible if the log format should ever happen to change. I'm extracting the date and time separately and passing both to `dateconv` I'm using `reverse()` on the portions of the date and time rather than passing a slice to `timelocal` According to your code, it seems that you are interested only in two records next to one another. I draw this conclusion because if you have three records on the same channel and ID, you'll not be able to tell if, for example, the third one and the first one are more than an hour apart. Is that really what you want? The only scenario that immediately explains to me the session code you're using is a periodic task completion, like travelling a circular route and crossing a start/finish line or passing a token back and forth on a network. I might just need more info to understand this, but there seems to be issues with the logging method. There's no indication in the information you present as to what's a start record and what's a stop record, yet you consider any pair of matching IDs with no likewise matching IDs between them a "record". Yet if you have more than two, you'll be considering the first and second as a session record, the second and the third as a session record, and the third and the fourth... Unless you're absolutely sure you'll never have more than two lines with the same ID (like if it's a unique session ID), then you're counting more sessions than you have. OTOH, if you're guaranteed to never have more than two lines with the same ID, then why do you need a count of the occurrences for that ID? Are you timing network connections, lap times around a track, stops at physical tool booths on a highway, or what? Given the troublesome issues I can't reconcile with your logging input and your code, I coded the above to match the first occurrence of a particular ID's time against any and all lines for that ID later. This gives a count of how many times an ID was logged more than an hour from the initial log line. It should be trivial to change that behavior back to the behavior your code represents.	[reply] [d/l] [select]
Re^2: Comparing Dates and Reoccurance - Part II by tuakilan (Acolyte) on Mar 22, 2008 at 17:24 UTC
Hi mr_mischief, I would like to thank you sincerely for helping me and guiding me around to manage this task which was a mid term assignment from our local C/S lecturer, whom I think should go for another refresher course for presentation. The log file is provided as it is, which was direct raw data provided by the tollbooth machines. According to your code, it seems that you are interested only in two r +ecords next to one another. I draw this conclusion because if you hav +e three records on the same channel and ID, you'll not be able to tel +l if, for example, the third one and the first one are more than an h +our apart. Is that really what you want? The only scenario that immed +iately explains to me the session code you're using is a periodic tas +k completion, like traveling a circular route and crossing a start/fi +nish line or passing a token back and forth on a network. [download] The complete daily logfiles which was handed over measured more than 20,000 lines and if there are 3 records next to one another, i won't be able to identify the real time gap. <b>quote</b> I might just need more info to understand this, but there seems to be +issues with the logging method. There's no indication in the informat +ion you present as to what's a start record and what's a stop record, + yet you consider any pair of matching IDs with no likewise matching +IDs between them a "record". Yet if you have more than two, you'll be + considering the first and second as a session record, the second and + the third as a session record, and the third and the fourth... Unles +s you're absolutely sure you'll never have more than two lines with t +he same ID (like if it's a unique session ID), then you're counting m +ore sessions than you have. OTOH, if you're guaranteed to never have +more than two lines with the same ID, then why do you need a count of + the occurrences for that ID? Are you timing network connections, lap + times around a track, stops at physical tool booths on a highway, or + what? [download] The task for the script is to analyse raw data pulled from physical tool booths on a highway, so there are records that may consists of a identical TIDS/class passing thru a particular channel/lane. now what exactly is TIDS/class, i reckon it is class of vehecle or even car registration number in that sense. Now to complicated the matter, the lecturer given want to have a text file, to include a list of TIDS characters, so that the script is only to scan for these TIDS and ignore the rest. in SQL statement, it may look like this : `select * from records where tids = ( xxxxxxxx,xxxxxx,xxxxxx ... ) # <--- read from TIDS file and channel = seven and time > 3600` [download]	[reply] [d/l] [select]