in reply to Re^4: Splitting Apache Log Files
in thread Splitting Apache Log Files

execution time went done to 22seconds.

You must have a pretty damn fast disk. SSD?

I split a 500MB file into 10 on the basis of a single digit at a fixed position in each line:

while( <> ) { print {$fhs[ substr $_, 2, 1 ]}, $_; }

And I can't get below 1 minute.

If you're managing to test each line against many (how many?) regexes, and still beat mine by 60%, I want a disk like yours.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^6: Splitting Apache Log Files
by cmm7825 (Novice) on Apr 27, 2010 at 05:36 UTC
    I'm not sure the exact specs...but its a production server. I assume it has RAID and SCSI hardisks

      If precompiling the regex makes such a big difference, you might consider trying eliminating your inner loop over the regexes by combining them into a single regex.

      What if any difference it would make will depend on how close to IO bound you are, but starting the regex engine multiple times for each line is relatively expensive. Combining the regex so they capture the matched string and then using the captured text to look up the appropriate filehandle might achieve some savings.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.