neilwatson has asked for the wisdom of the Perl Monks concerning the following question:

I have a log file. It could be hundreds of megabytes and thousands of lines. The contents would look something like this:
0000HA-00001 begin
Any number of lines here possibly including another begin.
0000HA-00001 doing-work
Any number of lines here possibly including another begin.
0000HA-00002 complete
File continues.

For each line containing serial number and begin I have to find the matching, doing-work and complete entries. If the matching entries do not exist report it.

My code (untested)

#!/usr/bin/perl use strict; use warnings; my (@entries, $x, $doing, $complete); open FH, "logfile"; while <FH>{ if (m/(\w{6}-\w{5})\s+begin/){ push @entries, $1; } } close FH; foreach $x (@entries){ open FH, "logfile"; while <FH>{ if (m/$x\s+doing-work/){ $doing = 1; } if (m/$x\s+complete/){ $complete = 1; } } close FH; if ($doing == 1 && $complete == 1){ $doing = 0; $complete = 0; next; #failed ! }else{ push @failed, $x; } } print "These entries have failed\n"; foreach $x (@entries){ print $x."\n"; }

This invovles going through the log file many times. Is there a more efficient way?

Neil Watson
watson-wilson.ca

Replies are listed 'Best First'.
Re: Efficiently parsing a large file
by matija (Priest) on Apr 08, 2004 at 20:24 UTC
    I think you should create a hash that will hold state:
    if (m/(/\w{6}-\w{5})\s+begin/) { $state{$1}='begin'; } elsif (m/(\w{6}-\w{5})\s+doing-work/){ if ($state{$1} eq 'begin') { $state{$1}='doing'; } else { warn "$1: doing work without being\n"; } } elsif ((m/\s+(\w{6}-\w{5})complete/) { if ($state{$1} eq 'doing') { $state{$1}='complete'; } else { warn "$1: got complete, but state=\"$state{$1}\"\n"; } }
    Then at the end, you just go through the hash, and print all entries that aren't in the state complete.

    That way, you only need to go through the file once.

      If the file is really huge (several hundred MB) it is important
      • to read the file only once
      • to keep memory allocation small
      I would therefore suggest to delete the hash item in case of "complete"
      delete ($state{$1});
      Looping through the hash having read the whole file would then show just the non completed cases.

      pelagic

        Hi Pelagic,

        Unless I'm mistaken, I believe he wants to keep complete items.. For each line containing serial number and begin I have to find the matching, doing-work and complete entries. If the matching entries do not exist report it., then again, I might be wrong ;-)

        Jason L. Froebe

        No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

      I agree with Neil: Very clever

      I am a bit nervous about putting it into a hash that isn't tied to a file because of the amount of memory involved (hundreds of megs for the file). I recommend tying the hash to a file (gdbm or similar) so you can still access it without eating up all the memory on the box.

      hope this helps

      Jason L. Froebe

      No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

        I think that the decision to tie the hash or not depends not upon the size of the file that is being read but more upon what percentage of the file being read has meaningful entries.

        If the majority of the entries in the file being read are just junk that can be ignored, then the state-hash can probably be maintained in memory w/o tying.

        If the data source is very rich, though, then it would we wise to tie the state-hash to another file and manage the entries.

        Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
Re: Efficiently parsing a large file
by DamnDirtyApe (Curate) on Apr 08, 2004 at 20:12 UTC

    Have you considered using Tie::File? I don't honestly know if it will be more efficient or not, but it might make for a more sensible implementation.


    _______________
    DamnDirtyApe
    Those who know that they are profound strive for clarity. Those who
    would like to seem profound to the crowd strive for obscurity.
                --Friedrich Nietzsche
Re: Efficiently parsing a large file
by jeffa (Bishop) on Apr 08, 2004 at 20:14 UTC

    Breezing through after being frustrated from work, but having to return to the grind ... how about forking? Everytime you encounter a 'begin', fork off a process and have it continue to search for the next two tokens.

    Just a thought. :)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
Re: Efficiently parsing a large file
by tachyon (Chancellor) on Apr 09, 2004 at 03:13 UTC

    How about using a bitmap state hash, deleting as you go to limit memory footprint. Single pass.....

    my %states = ( 'begin' => 1, 'doing-work' => 2, 'complete' => 4, ); my $re = join '|', keys %states; $re = qr/^(\S+)\s*($re)/; my %fail_decode = ( 6 => 'no begin', 5 => 'no doing work', 4 => 'no begin or doing work', 3 => 'no complete', 2 => 'no begin or complete', 1 => 'no doing work or complete', ); my %status; while(<DATA>) { next unless /$re/; $status{$1} += $states{$2}; delete $status{$1} if $status{$1} == 7; } printf "$_\t%s\n", $fail_decode{$status{$_}} for keys %status; __DATA__ foo begin bar begin baz complete foo doing-work foo complete bar doing-work

    cheers

    tachyon

Re: Efficiently parsing a large file
by pizza_milkshake (Monk) on Apr 09, 2004 at 02:34 UTC
    yeah, how about not looping through the file more than once for starters.
    #!perl -wl use strict; #bit fields my %status = qw( begin 1 doing-work 2 complete 4 ); # figure out what a "perfect" score is my $ok = (2 ** (scalar keys %status)) - 1; # where we keep stuff my (%entries); #open F, "<logfile" or die $!; while (<DATA>) { if (/^(\d{4}HA-\d{5}) (\S+)$/) { $entries{$1} |= $status{$2}; } } #close F; #filter out entries that passed for (grep{ $entries{$_} == $ok }keys %entries) { delete $entries{$_}; } printf( "%d entries failed\n", scalar keys %entries ); for (sort keys %entries) { print "$_: $entries{$_}"; } __DATA__ 1234HA-00001 begin 1234HA-00002 begin 1234HA-00003 begin 1234HA-00001 doing-work 1234HA-00002 doing-work 1234HA-00001 complete

    perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'

Re: Efficiently parsing a large file
by bluto (Curate) on Apr 08, 2004 at 21:18 UTC
    Using a hash as mentioned earlier is probably easiest though it can get slow on huge numbers of elements -- even if you have to resort to tieing a hash to a file. One low tech way, if you are using some unix variant and have extra disk, may be to use the unix sort utility before you pass through the file with perl. Then all of your entries should appear in order or at least close together (though you may have to play some games with sort's options to get the exact order you want).

    I've used sort on files in the GB range with millions of lines of text. Note: one good sanity check, in addition to checking for errors from sort, is to make sure the sorted file is the same size as the input file. Update: Took out references to "GNU" since I'm not sure I've specifically used it on files that large (though it might work fine).

      don't forget you could just read the output of sort directly.. that way you don't actually need to create another file to read in..

      Jason L. Froebe

      No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

        sort will create separate temp files anyway which it then merges together and pipes to you. You really aren't gaining much since you need almost as much disk space. Also, pipes can slow down IO for huge files. They are usually limited to a few KB of buffer so you get a lot of swapping (i.e. sort feeds some bytes to you then goes to sleep, you wake up and process them then go to sleep, then sort feeds some more, etc.).
      Sorting with good tools can be very efficent but in this case you have to scan through the whole (sorted or unsorted) file one time anyway to find the (un-)complete sets. As the file file is nearly sorted anyway it's probably most efficent to not sort it explicitly.

      pelagic
        My impression from the original post was that lines were sorted within a serial number, but there could be many lines of other serial numbers interspersed in between. If this is wrong, then you are correct.
Re: Efficiently parsing a large file
by Anonymous Monk on Apr 09, 2004 at 06:12 UTC
    This should do the trick.
    my %idx; open FH, "logfile" or die "can't open logfile"; while (<FH>) { my ($ser,$msg) = $_ =~ /^(.+?-.+?)\s+(\S+)/; $idx{$ser}->{$msg}++; if ($idx{$ser}->{begin} && $idx{$ser}->{doing-work} && $idx{$ser}->{complete} ) { # found all the stuff, remove the entry. delete $idx{$ser}; } } print "didn't complete:\n"; foreach my $ser (keys %idx) { print "\t$ser\n"; }
Re: Efficiently parsing a large file
by ttg (Initiate) on Apr 13, 2004 at 01:27 UTC
    Neil, Assuming some flavor of Unix...if not disregard. You could pre-process the file with the uniq command which should compress any concurrent entries in the log file into one line. That might significantly reduce the number of lines you need to process. Then building a hash of your opens with a value of 2, and testing your working settings and subtracting 1, (new value in hash is 1), then process your closes setting the value of the corresponding open hash to 0 (subtract one again) might be workable. That way you would know what was opened, worked on, and closed, and it could be a one pass thru the file thing. Just a thought. tim