in reply to Efficiently parsing a large file

I think you should create a hash that will hold state:
if (m/(/\w{6}-\w{5})\s+begin/) { $state{$1}='begin'; } elsif (m/(\w{6}-\w{5})\s+doing-work/){ if ($state{$1} eq 'begin') { $state{$1}='doing'; } else { warn "$1: doing work without being\n"; } } elsif ((m/\s+(\w{6}-\w{5})complete/) { if ($state{$1} eq 'doing') { $state{$1}='complete'; } else { warn "$1: got complete, but state=\"$state{$1}\"\n"; } }
Then at the end, you just go through the hash, and print all entries that aren't in the state complete.

That way, you only need to go through the file once.

Replies are listed 'Best First'.
Re: Re: Efficiently parsing a large file
by pelagic (Priest) on Apr 08, 2004 at 21:02 UTC
    If the file is really huge (several hundred MB) it is important
    • to read the file only once
    • to keep memory allocation small
    I would therefore suggest to delete the hash item in case of "complete"
    delete ($state{$1});
    Looping through the hash having read the whole file would then show just the non completed cases.

    pelagic

      Hi Pelagic,

      Unless I'm mistaken, I believe he wants to keep complete items.. For each line containing serial number and begin I have to find the matching, doing-work and complete entries. If the matching entries do not exist report it., then again, I might be wrong ;-)

      Jason L. Froebe

      No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

        Neil will be telling us ...

        pelagic
Re: Re: Efficiently parsing a large file
by jfroebe (Parson) on Apr 08, 2004 at 21:01 UTC

    I agree with Neil: Very clever

    I am a bit nervous about putting it into a hash that isn't tied to a file because of the amount of memory involved (hundreds of megs for the file). I recommend tying the hash to a file (gdbm or similar) so you can still access it without eating up all the memory on the box.

    hope this helps

    Jason L. Froebe

    No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

      I think that the decision to tie the hash or not depends not upon the size of the file that is being read but more upon what percentage of the file being read has meaningful entries.

      If the majority of the entries in the file being read are just junk that can be ignored, then the state-hash can probably be maintained in memory w/o tying.

      If the data source is very rich, though, then it would we wise to tie the state-hash to another file and manage the entries.

      Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
Re: Re: Efficiently parsing a large file
by neilwatson (Priest) on Apr 08, 2004 at 20:32 UTC