is this the most efficient way to double-parse a large file?

jasonl has asked for the wisdom of the Perl Monks concerning the following question:

I'm parsing a large logfile (this is sort-of-but-not-entirely related to another question I just posted about sorting by timestamp) that has data from a bunch of different clients in it, and once a particular client hits a trigger, I need to get all data in the logfile from that client, whether it's before or after the trigger. What I'm doing currently is looping through the file twice, getting the list of clients that hit the trigger on the first pass, and then on the second pass looking for *any* lines logged by a client, like this:

    %clientHash = getTrigger(\*INFILE);
    seek(INFILE,0,0);
    %leaseHash = readLog(\*INFILE, \%clientHash);
[download]

(readLog() splits each input line and looks to see if the client field from that line matches an existing key in %clientHash.) This works, but I'm wondering if there's a cleaner way to do it.

Comment on is this the most efficient way to double-parse a large file? Download Code

Replies are listed 'Best First'.
Re: is this the most efficient way to double-parse a large file? by hdb (Monsignor) on Jan 19, 2014 at 19:07 UTC
If your file is huge, then it is probably more of a database job. Otherwise, you could build an index for ALL clients in the first round using tell and then use seek to retrieve those lines in the second round. Whether this extra complexity is justified, would depend on the size of your log file and the number of clients you have. Update: sample code use strict; use warnings; my $tell = tell DATA; while(<DATA>){ if( /^(.*?):/ ) { my $client = $1; push @{$lines{$client}}, $tell; $triggered{$client}++ if /trigger/; } $tell = tell DATA; } print "Found triggers for ", join( ", ", keys %triggered ), ".\n"; for( keys %triggered ) { print "Log for client $_:\n"; for( @{$lines{$_}} ) { seek DATA, $_, 0; my $line = <DATA>; print "\t$line"; } } __DATA__ arthur: line 1 arthur: line 2 trigger ford: line 3 # some comment zaphod: line 4 zaphod: line 5 trigger arthur: line 6 ford: line 7 [download]	[reply] [d/l]
Re: is this the most efficient way to double-parse a large file? by Laurent_R (Canon) on Jan 19, 2014 at 20:04 UTC
One possibility might be to split your log file into client-specific files as you parse your log file, so that when you hit the limit for a client, you can just use the file related to that client.	[reply]
Re: is this the most efficient way to double-parse a large file? by GrandFather (Saint) on Jan 21, 2014 at 00:49 UTC
How big is your "large logfile"? If it's less than 1/2 the memory in your computer then simply keep all the client data in a hash, only parse the file once and print the report using the triggered client's data at the end. If the file is too big for the memory based approach and memory size > (20 * clients * client entries) then hdb's solution should be fine, otherwise either a database as suggested by hdb, or parse the logfile once, but write each client's data out to its own file in the report format you need then process the triggered client's data after parsing your logfile. Note that you need to take care not to open too may file handles with this last approach! True laziness is hard work	[reply]
Re^2: is this the most efficient way to double-parse a large file? by jasonl (Acolyte) on Jan 21, 2014 at 16:21 UTC
By today's standards it's probably not excessively large, +/- 100MB each (although there could be cases where multiple files will be catenetated before processing). I was worried that a single hash with everything in it would be too large, but if 1/2 available memory is the rule of thumb I should be good. A DB is definitely overkill, as each dataset will likely only be processed once or twice and then discarded. Thanks.	[reply]
Re^3: is this the most efficient way to double-parse a large file? by GrandFather (Saint) on Jan 21, 2014 at 22:03 UTC
1/2 could be almost any number. The reply was more to shake up your thinking a little to make you think more in terms of "let's try the simple way first". Remember: premature optimisation is the root of all evil. The important rule of thumb is: "If the code changes take longer than the run time saved, it's fast enough already". If the code changes take longer than the run time saved, it's fast enough already.	[reply]