tuakilan has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

This is to follow up from on http://www.perlmonks.com/?node_id=673882 where i posted questions on how to deal with dates and re occurrences

for a newbie like me i am trying to finish an assignment and i am pulling my hairs :( and now the complexity has gone up

The tasks :

1. read a raw ASCII log file which was collected by a toll collecting machine.
2. from the log file, using "tids" and "channel" as the key, locate records that are longer than 3600 seconds.
NEW
3. a seperate ASCII file, call tids-list.txt, shall contain the list of 'tids' which are tids values used in task no 2
4. record down how many times such incident happened and identify it as 'occurrences'
5. output the result in the order as shown in 'report-2007-01-01.txt'.

in SQL statement, it look similar to this

select * from
where channel = seven
and tids = ( 123456789, 987654321, ... )
and time > 3600 seconds
commit;

Exact raw ASCII logfile from toll collecting machine, tollog-2007-jan-01.txt

2008-Jan-01 00:00:00 UTC (GMT +0000) - Toll: channel = seven, ref = xx +x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 123456789 2008-Jan-01 00:10:00 UTC (GMT +0000) - Toll: channel = six, ref = xxx. +xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 987654321 2008-Jan-01 00:20:00 UTC (GMT +0000) - Toll: channel = three, ref = xx +x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 223344221 2008-Jan-01 00:30:00 UTC (GMT +0000) - Toll: channel = four, ref = xxx +.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 998829992 2008-Jan-01 00:40:00 UTC (GMT +0000) - Toll: channel = three, ref = xx +x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 938874724 2008-Jan-01 00:50:00 UTC (GMT +0000) - Toll: channel = two, ref = xxx. +xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 229928828 2008-Jan-01 01:00:00 UTC (GMT +0000) - Toll: channel = five, ref = xxx +.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 998822992 2008-Jan-01 01:10:00 UTC (GMT +0000) - Toll: channel = seven, ref = xx +x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 123456789

As you can see from the above, record 1 and 8 are the output which are desired as these 2 records has the same channel name and tids number.

Desired report file : report-2007-01-01.txt

TIDS time Occurance ==================================================== 123456789 2008-Jan-01 01:10:00 2

Sample 'tids-list.txt'

123456789 987654321 112233445 888899889

So far what i did was the following but it wrote a zero byte size file :(

#!/usr/local/bin/perl -w use strict; use warnings; use Time::Local; my $infile = 'input.2008-01-01.log'; my $outfile = 'output.2008-01-01.log'; my($fh_out, $fh); open($fh_out, '>', $outfile) or die "Could not open outfile: $!"; open($fh, '<', $infile) or die "Could not open logfile: $!"; my %track; while (<$fh>){ my ($date,$ignoreIDLiteral,$id) = split / - | = /; chomp $id; my $time = dateconv($date); my $prevtime = $track{$id}{TIME}; $track{$id}{TIME}=$time; $track{$id}{DATE}=$date; $track{$id}{COUNT}++; print "$id\t$date\t$track{$id}{COUNT}\n" if $prevtime and $time - $prevtime > 3600; } sub dateconv{ my $d = shift; my %month = qw[jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7 aug 8 sep 9 oct 10 nov 11 dec 12]; my @p = $d=~/(\d+)-(\w+)-(\d+)\s(\d+):(\d+):(\d+)/; $p[1]=$month{ lc $p[1] } - 1; return timelocal(@p[5,4,3,2,1,0]); #timelocal($sec,$min,$hour,$mday,$mon,$year); } close $fh_out; close $fh;

I think I messed up with the regex of the incoming logfile. Anyone can correct me where i did wrong ?

how to add the 'tids-list.txt' into the search routine ?

Thank you very much !!!

Replies are listed 'Best First'.
Re: Comparing Dates and Reoccurance - Part III
by ww (Archbishop) on Mar 22, 2008 at 22:53 UTC
    Second NetWallah's comment re paid consulting (despite the contrarian act of providing of a solution).

    Compare the split at line 19 of the previous with what you've done (again!).

    And, to gild the lily, here's a variant on what the previous answer provided, done in a very cumbersome manner, in the hope it offers some additional insight into how the regex in split (and regexen generally) actually work:

    my $data ="2008-Jan-01 00:00:00 UTC (GMT +0000) - Toll: channel = seve +n, ref = xxx.xxxxxx.xxx.xxxxx.xxxxxxx, tids = 123456789"; print "\n---->Variation 1 (Cumbersome coding to be explicit)\n\n"; # split just once, on " - " (space hyphen space) for the data given my ($datetime, $therest) = split /\s-\s/, $data; my ($toss1away,$channel,$therest2) = split /=\s(.*?),(.*)/, $therest; # capture anything between "= " and the next comma my ($toss2away,$ref,$tids_uncleaned) = split /=\s(.*?),/, $therest2; my ($toss3away,$tids) = split /=\s(.*)/, $tids_uncleaned; print "DT: " . $datetime . "\nChannel: " . $channel . "\nref: " . $ref + . "\ntids: " . $tids . "\n"; print "\n---->Variation 2 (Merely splits the data withOUT removing unn +eeded descriptors)\n\n"; my ($datetime,$channel,$ref,$tids) = split /\s-\s|,\s/, $data; # spl +it on , print "DT: " . $datetime . "\nChannel: " . $channel . "\nref: " . $ref + . "\ntids: " . $tids . "\n"; print "\n---->Variation 3 (Provides a header, removes descriptors from + data.\n\tCould easily be revised to push each set of multi-line data + to an AoA.)\n\n"; my $header=<<HEADER; DATETIME\t\t\t\tChannel\tRef\t\t\t\ttids ====================================================================== +======================== HEADER print $header; # now, get rid of "Toll: channel = " (so output is just "seven"), etc my ($throwaway,$cleanchannel) = split /\s=\s(.*)/, $channel; ($throwaway,my $cleanref) = split /\s=\s(.*)/, $ref; ($throwaway,my $cleantids) = split /\s=\s(.*)/, $tids; print "$datetime\t$cleanchannel\t$cleanref\t$cleantids\n";

    OUTPUT

    perl 675658-2.pl ---->Variation 1 (Cumbersome coding to be explicit) DT: 2008-Jan-01 00:00:00 UTC (GMT +0000) Channel: seven ref: xxx.xxxxxx.xxx.xxxxx.xxxxxxx tids: 123456789 ---->Variation 2 (Merely splits the data withOUT removing unneeded des +criptors) DT: 2008-Jan-01 00:00:00 UTC (GMT +0000) Channel: Toll: channel = seven ref: ref = xxx.xxxxxx.xxx.xxxxx.xxxxxxx tids: tids = 123456789 ---->Variation 3 (Provides a header, removes descriptors from data. Could easily be revised to push each item to an AoA for further proces +sing, already explained elsewhere.) DATETIME Channel Ref + tids ====================================================================== +======================== 2008-Jan-01 00:00:00 UTC (GMT +0000) seven xxx.xxxxxx.xxx.xxxxx.x +xxxxxx 123456789

    You'll probably get better help on future questions if you keep in mind that we're not collecting part of the payment you receive from (whatever) toll road agency.

Re: Comparing Dates and Reoccurance - Part III
by stiller (Friar) on Mar 22, 2008 at 18:13 UTC
    I think the best help I can offer right now, is to ask you to figure out a strategy you can use to investigate the result of aplying
    split / - | = /;
    to
    2008-Jan-01 00:00:00 UTC (GMT +0000) - Toll: channel = seven, ref = xx +x.xxxxxx.xxx.xxxxx.xxxxxxx.xxxxxxxxxxxxxxxxxxxxx, tids = 123456789
    You are going to need that, over and over again.
Re: Comparing Dates and Reoccurance - Part III
by NetWallah (Canon) on Mar 22, 2008 at 22:26 UTC
    The zero-byte file size was explained to you, twice in earlier posts.

    mr_mischief posted excellent functioning code, in response to your previous requests.

    This site does not encourage asking for or providing complete solutions to user requests - that function should be directed toward paid consultants.

    That said, here is some code that integrates mr mischief's code, your requirements, and my "cleanup".

         "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom