Efficiently parsing a large file

neilwatson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficiently parsing a large file by matija (Priest) on Apr 08, 2004 at 20:24 UTC
I think you should create a hash that will hold state: `if (m/(/\w{6}-\w{5})\s+begin/) { $state{$1}='begin'; } elsif (m/(\w{6}-\w{5})\s+doing-work/){ if ($state{$1} eq 'begin') { $state{$1}='doing'; } else { warn "$1: doing work without being\n"; } } elsif ((m/\s+(\w{6}-\w{5})complete/) { if ($state{$1} eq 'doing') { $state{$1}='complete'; } else { warn "$1: got complete, but state=\"$state{$1}\"\n"; } }` [download] Then at the end, you just go through the hash, and print all entries that aren't in the state complete. That way, you only need to go through the file once.	[reply] [d/l]
Re: Re: Efficiently parsing a large file by pelagic (Priest) on Apr 08, 2004 at 21:02 UTC
If the file is really huge (several hundred MB) it is important • to read the file only once • to keep memory allocation small I would therefore suggest to delete the hash item in case of "complete" `delete ($state{$1});` [download] Looping through the hash having read the whole file would then show just the non completed cases. pelagic	[reply] [d/l]
Re: Re: Re: Efficiently parsing a large file by jfroebe (Parson) on Apr 08, 2004 at 21:15 UTC
Hi Pelagic, Unless I'm mistaken, I believe he wants to keep complete items.. For each line containing serial number and begin I have to find the matching, doing-work and complete entries. If the matching entries do not exist report it., then again, I might be wrong ;-) Jason L. Froebe No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1	[reply]
Re *: Efficiently parsing a large file by pelagic (Priest) on Apr 08, 2004 at 21:23 UTC
Re: Re: Re: Re: Efficiently parsing a large file by neilwatson (Priest) on Apr 08, 2004 at 22:49 UTC
Re: Re: Efficiently parsing a large file by jfroebe (Parson) on Apr 08, 2004 at 21:01 UTC
I agree with Neil: Very clever I am a bit nervous about putting it into a hash that isn't tied to a file because of the amount of memory involved (hundreds of megs for the file). I recommend tying the hash to a file (gdbm or similar) so you can still access it without eating up all the memory on the box. hope this helps Jason L. Froebe No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1	[reply]
Re: Re: Re: Efficiently parsing a large file by Art_XIV (Hermit) on Apr 08, 2004 at 21:14 UTC
I think that the decision to tie the hash or not depends not upon the size of the file that is being read but more upon what percentage of the file being read has meaningful entries. If the majority of the entries in the file being read are just junk that can be ignored, then the state-hash can probably be maintained in memory w/o tying. If the data source is very rich, though, then it would we wise to tie the state-hash to another file and manage the entries. Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"	[reply]
Re: Re: Efficiently parsing a large file by neilwatson (Priest) on Apr 08, 2004 at 20:32 UTC
That's very clever :) Neil Watson watson-wilson.ca	[reply]
Re: Efficiently parsing a large file by DamnDirtyApe (Curate) on Apr 08, 2004 at 20:12 UTC
Have you considered using Tie::File? I don't honestly know if it will be more efficient or not, but it might make for a more sensible implementation. _______________ DamnDirtyApe Those who know that they are profound strive for clarity. Those who would like to seem profound to the crowd strive for obscurity. --Friedrich Nietzsche	[reply]
Re: Efficiently parsing a large file by jeffa (Bishop) on Apr 08, 2004 at 20:14 UTC
Breezing through after being frustrated from work, but having to return to the grind ... how about forking? Everytime you encounter a 'begin', fork off a process and have it continue to search for the next two tokens. Just a thought. :) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re: Efficiently parsing a large file by tachyon (Chancellor) on Apr 09, 2004 at 03:13 UTC
How about using a bitmap state hash, deleting as you go to limit memory footprint. Single pass..... my %states = ( 'begin' => 1, 'doing-work' => 2, 'complete' => 4, ); my $re = join '\|', keys %states; $re = qr/^(\S+)\s*($re)/; my %fail_decode = ( 6 => 'no begin', 5 => 'no doing work', 4 => 'no begin or doing work', 3 => 'no complete', 2 => 'no begin or complete', 1 => 'no doing work or complete', ); my %status; while(<DATA>) { next unless /$re/; $status{$1} += $states{$2}; delete $status{$1} if $status{$1} == 7; } printf "$_\t%s\n", $fail_decode{$status{$_}} for keys %status; __DATA__ foo begin bar begin baz complete foo doing-work foo complete bar doing-work [download] cheers tachyon	[reply] [d/l]
Re: Efficiently parsing a large file by pizza_milkshake (Monk) on Apr 09, 2004 at 02:34 UTC
yeah, how about not looping through the file more than once for starters. #!perl -wl use strict; #bit fields my %status = qw( begin 1 doing-work 2 complete 4 ); # figure out what a "perfect" score is my $ok = (2 ** (scalar keys %status)) - 1; # where we keep stuff my (%entries); #open F, "<logfile" or die $!; while (<DATA>) { if (/^(\d{4}HA-\d{5}) (\S+)$/) { $entries{$1} \|= $status{$2}; } } #close F; #filter out entries that passed for (grep{ $entries{$_} == $ok }keys %entries) { delete $entries{$_}; } printf( "%d entries failed\n", scalar keys %entries ); for (sort keys %entries) { print "$_: $entries{$_}"; } __DATA__ 1234HA-00001 begin 1234HA-00002 begin 1234HA-00003 begin 1234HA-00001 doing-work 1234HA-00002 doing-work 1234HA-00001 complete [download] perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'	[reply] [d/l]
Re: Efficiently parsing a large file by bluto (Curate) on Apr 08, 2004 at 21:18 UTC
Using a hash as mentioned earlier is probably easiest though it can get slow on huge numbers of elements -- even if you have to resort to tieing a hash to a file. One low tech way, if you are using some unix variant and have extra disk, may be to use the unix sort utility before you pass through the file with perl. Then all of your entries should appear in order or at least close together (though you may have to play some games with sort's options to get the exact order you want). I've used sort on files in the GB range with millions of lines of text. Note: one good sanity check, in addition to checking for errors from sort, is to make sure the sorted file is the same size as the input file. Update: Took out references to "GNU" since I'm not sure I've specifically used it on files that large (though it might work fine).	[reply]
Re: Re: Efficiently parsing a large file by jfroebe (Parson) on Apr 08, 2004 at 21:23 UTC
don't forget you could just read the output of sort directly.. that way you don't actually need to create another file to read in.. Jason L. Froebe No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1	[reply]
Re: Re: Re: Efficiently parsing a large file by bluto (Curate) on Apr 08, 2004 at 21:33 UTC
sort will create separate temp files anyway which it then merges together and pipes to you. You really aren't gaining much since you need almost as much disk space. Also, pipes can slow down IO for huge files. They are usually limited to a few KB of buffer so you get a lot of swapping (i.e. sort feeds some bytes to you then goes to sleep, you wake up and process them then go to sleep, then sort feeds some more, etc.).	[reply]
Re: Re: Efficiently parsing a large file by pelagic (Priest) on Apr 08, 2004 at 21:31 UTC
Sorting with good tools can be very efficent but in this case you have to scan through the whole (sorted or unsorted) file one time anyway to find the (un-)complete sets. As the file file is nearly sorted anyway it's probably most efficent to not sort it explicitly. pelagic	[reply]
Re: Re: Re: Efficiently parsing a large file by bluto (Curate) on Apr 08, 2004 at 21:41 UTC
My impression from the original post was that lines were sorted within a serial number, but there could be many lines of other serial numbers interspersed in between. If this is wrong, then you are correct.	[reply]
Re^4: Efficiently parsing a large file by pelagic (Priest) on Apr 08, 2004 at 21:59 UTC
Re: Re^4: Efficiently parsing a large file by bluto (Curate) on Apr 08, 2004 at 23:35 UTC
Re: Efficiently parsing a large file by Anonymous Monk on Apr 09, 2004 at 06:12 UTC
This should do the trick. `my %idx; open FH, "logfile" or die "can't open logfile"; while (<FH>) { my ($ser,$msg) = $_ =~ /^(.+?-.+?)\s+(\S+)/; $idx{$ser}->{$msg}++; if ($idx{$ser}->{begin} && $idx{$ser}->{doing-work} && $idx{$ser}->{complete} ) { # found all the stuff, remove the entry. delete $idx{$ser}; } } print "didn't complete:\n"; foreach my $ser (keys %idx) { print "\t$ser\n"; }` [download]	[reply] [d/l]
Re: Efficiently parsing a large file by ttg (Initiate) on Apr 13, 2004 at 01:27 UTC
Neil, Assuming some flavor of Unix...if not disregard. You could pre-process the file with the uniq command which should compress any concurrent entries in the log file into one line. That might significantly reduce the number of lines you need to process. Then building a hash of your opens with a value of 2, and testing your working settings and subtracting 1, (new value in hash is 1), then process your closes setting the value of the corresponding open hash to 0 (subtract one again) might be workable. That way you would know what was opened, worked on, and closed, and it could be a one pass thru the file thing. Just a thought. tim	[reply]