in reply to Comparing Dates and Reoccurance - Part II

Why would you print the number of occurrences in the loop, in which case every additional occurrence gets listed with the current count of past occurrences? I think you need to build your data structure completely then iterate over it.

Try this on for size:

#!/usr/bin/perl -- use strict; use warnings; use Time::Local; use POSIX qw( strftime ); my %conf = ( 'input' => 'input.2008-01-01.log', 'output' => 'output.2008-01-01.log', 'duration' => 3600, ); my %track; sub dateconv { my ( $date, $time ) = @_; my %months = qw( Jan 01 Feb 02 Mar 03 Apr 04 May 05 Jun 06 Jul 07 +Aug 08 Sep 09 Oct 10 Nov 11 Dec 12 ); my @parts = reverse split /:/, $time; push @parts, reverse split /-/, $date; $parts[4] = $months{ $parts[4] } - 1; return timelocal( @parts ); } open ( my $in, '<', $conf{ 'input' } ) or die 'Cannot open input file + ' . $conf{ 'input' } . ": $!\n"; while ( <$in> ) { chomp; # Example input: # 2008-Jan-01 00:00:00 UTC (GMT +0000) - Toll: channel = seven, re +f = xxx.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 123456 +789 if ( /(\d{4}-\w{3}-\d{2})\s (\d{2}:\d{2}:\d{2})\s \w+\s \(GMT\s([\+\-]\d{4})\)\s -\sToll:\schannel\s=\s(\w+),\s ref\s=\s\S+,\s tids\s=\s(\d+) /x ) { my ( $date, $time, $offset ) = ( $1, $2, $3 ); my ( $channel, $id ) = ( $4, $5 ); my $e_time = dateconv( $date, $time ); if ( defined $track{ $channel }{ $id } ) { if ( $e_time - $track{ $channel }{ $id }{ 'time' } > $conf +{ 'duration' } ) { $track{ $channel }{ $id }{ 'occurrences' }++; } } else { $track{ $channel }{ $id }{ 'time' } = $e_time; $track{ $channel }{ $id }{ 'occurrences' } = 1; } } else { print "line does not match!\n"; } } close $in; open ( my $out, '>', $conf{ 'output' } ) or die 'Cannot open output fi +le ' . $conf{ 'input' } . ": $!\n"; print $out <<_HEADER; TIDS time Occurrence ==================================================== _HEADER foreach my $channel ( sort keys %track ) { foreach my $id ( sort keys %{ $track{$channel} } ) { if ( $track{$channel}{$id}{ 'occurrences' } > 1 ) { my $date_time = POSIX::strftime( '%Y-%b-%d %H:%M:%S', ( lo +caltime( $track{$channel}{$id}{ 'time' } ) ) ); print $out "$id\t\t$date_time\t" . $track{$channel}{$id}{ +'occurrences' } . "\n"; } } } close $out; __END__

I've made a few slight modifications which don't necessarily reflect your errors, but which reflect how I'd attack the problem:

According to your code, it seems that you are interested only in two records next to one another. I draw this conclusion because if you have three records on the same channel and ID, you'll not be able to tell if, for example, the third one and the first one are more than an hour apart. Is that really what you want? The only scenario that immediately explains to me the session code you're using is a periodic task completion, like travelling a circular route and crossing a start/finish line or passing a token back and forth on a network.

I might just need more info to understand this, but there seems to be issues with the logging method. There's no indication in the information you present as to what's a start record and what's a stop record, yet you consider any pair of matching IDs with no likewise matching IDs between them a "record". Yet if you have more than two, you'll be considering the first and second as a session record, the second and the third as a session record, and the third and the fourth... Unless you're absolutely sure you'll never have more than two lines with the same ID (like if it's a unique session ID), then you're counting more sessions than you have. OTOH, if you're guaranteed to never have more than two lines with the same ID, then why do you need a count of the occurrences for that ID? Are you timing network connections, lap times around a track, stops at physical tool booths on a highway, or what?

Given the troublesome issues I can't reconcile with your logging input and your code, I coded the above to match the first occurrence of a particular ID's time against any and all lines for that ID later. This gives a count of how many times an ID was logged more than an hour from the initial log line. It should be trivial to change that behavior back to the behavior your code represents.

Replies are listed 'Best First'.
Re^2: Comparing Dates and Reoccurance - Part II
by tuakilan (Acolyte) on Mar 22, 2008 at 17:24 UTC
    Hi mr_mischief,

    I would like to thank you sincerely for helping me and guiding me around to manage this task which was a mid term assignment from our local C/S lecturer, whom I think should go for another refresher course for presentation.

    The log file is provided as it is, which was direct raw data provided by the tollbooth machines.


    According to your code, it seems that you are interested only in two r +ecords next to one another. I draw this conclusion because if you hav +e three records on the same channel and ID, you'll not be able to tel +l if, for example, the third one and the first one are more than an h +our apart. Is that really what you want? The only scenario that immed +iately explains to me the session code you're using is a periodic tas +k completion, like traveling a circular route and crossing a start/fi +nish line or passing a token back and forth on a network.

    The complete daily logfiles which was handed over measured more than 20,000 lines and if there are 3 records next to one another, i won't be able to identify the real time gap.

    <b>quote</b> I might just need more info to understand this, but there seems to be +issues with the logging method. There's no indication in the informat +ion you present as to what's a start record and what's a stop record, + yet you consider any pair of matching IDs with no likewise matching +IDs between them a "record". Yet if you have more than two, you'll be + considering the first and second as a session record, the second and + the third as a session record, and the third and the fourth... Unles +s you're absolutely sure you'll never have more than two lines with t +he same ID (like if it's a unique session ID), then you're counting m +ore sessions than you have. OTOH, if you're guaranteed to never have +more than two lines with the same ID, then why do you need a count of + the occurrences for that ID? Are you timing network connections, lap + times around a track, stops at physical tool booths on a highway, or + what?

    The task for the script is to analyse raw data pulled from physical tool booths on a highway, so there are records that may consists of a identical TIDS/class passing thru a particular channel/lane. now what exactly is TIDS/class, i reckon it is class of vehecle or even car registration number in that sense.

    Now to complicated the matter, the lecturer given want to have a text file, to include a list of TIDS characters, so that the script is only to scan for these TIDS and ignore the rest.

    in SQL statement, it may look like this :

    select * from records where tids = ( xxxxxxxx,xxxxxx,xxxxxx ... ) # <--- read from TIDS file and channel = seven and time > 3600