Re: Efficiently parsing a large file

I think you should create a hash that will hold state:

if (m/(/\w{6}-\w{5})\s+begin/) {
   $state{$1}='begin';
} elsif (m/(\w{6}-\w{5})\s+doing-work/){
   if ($state{$1} eq 'begin') {
     $state{$1}='doing';
   } else {
     warn "$1: doing work without being\n";
   }
} elsif ((m/\s+(\w{6}-\w{5})complete/) {
   if ($state{$1} eq 'doing') {
      $state{$1}='complete';
    } else {
      warn "$1: got complete, but state=\"$state{$1}\"\n";
    }
}
[download]

Then at the end, you just go through the hash, and print all entries that aren't in the state complete.

That way, you only need to go through the file once.

Comment on Re: Efficiently parsing a large file Download Code

Replies are listed 'Best First'.
Re: Re: Efficiently parsing a large file by pelagic (Priest) on Apr 08, 2004 at 21:02 UTC
If the file is really huge (several hundred MB) it is important • to read the file only once • to keep memory allocation small I would therefore suggest to delete the hash item in case of "complete" `delete ($state{$1});` [download] Looping through the hash having read the whole file would then show just the non completed cases. pelagic	[reply] [d/l]
Re: Re: Re: Efficiently parsing a large file by jfroebe (Parson) on Apr 08, 2004 at 21:15 UTC
Hi Pelagic, Unless I'm mistaken, I believe he wants to keep complete items.. For each line containing serial number and begin I have to find the matching, doing-work and complete entries. If the matching entries do not exist report it., then again, I might be wrong ;-) Jason L. Froebe No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1	[reply]
Re *: Efficiently parsing a large file by pelagic (Priest) on Apr 08, 2004 at 21:23 UTC
Neil will be telling us ... pelagic	[reply]
Re: Re: Re: Re: Efficiently parsing a large file by neilwatson (Priest) on Apr 08, 2004 at 22:49 UTC
Pelagic is correct. I am only interested in the uncompleted entries. Neil Watson watson-wilson.ca	[reply]
Re: Re: Efficiently parsing a large file by jfroebe (Parson) on Apr 08, 2004 at 21:01 UTC
I agree with Neil: Very clever I am a bit nervous about putting it into a hash that isn't tied to a file because of the amount of memory involved (hundreds of megs for the file). I recommend tying the hash to a file (gdbm or similar) so you can still access it without eating up all the memory on the box. hope this helps Jason L. Froebe No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1	[reply]
Re: Re: Re: Efficiently parsing a large file by Art_XIV (Hermit) on Apr 08, 2004 at 21:14 UTC
I think that the decision to tie the hash or not depends not upon the size of the file that is being read but more upon what percentage of the file being read has meaningful entries. If the majority of the entries in the file being read are just junk that can be ignored, then the state-hash can probably be maintained in memory w/o tying. If the data source is very rich, though, then it would we wise to tie the state-hash to another file and manage the entries. Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"	[reply]
Re: Re: Efficiently parsing a large file by neilwatson (Priest) on Apr 08, 2004 at 20:32 UTC
That's very clever :) Neil Watson watson-wilson.ca	[reply]