in reply to Re^3: Splitting Apache Log Files
in thread Splitting Apache Log Files

WOW, never knew that. Thanks a lot the execution time went done to 22seconds.

Replies are listed 'Best First'.
Re^5: Splitting Apache Log Files
by BrowserUk (Patriarch) on Apr 26, 2010 at 21:02 UTC
    execution time went done to 22seconds.

    You must have a pretty damn fast disk. SSD?

    I split a 500MB file into 10 on the basis of a single digit at a fixed position in each line:

    while( <> ) { print {$fhs[ substr $_, 2, 1 ]}, $_; }

    And I can't get below 1 minute.

    If you're managing to test each line against many (how many?) regexes, and still beat mine by 60%, I want a disk like yours.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I'm not sure the exact specs...but its a production server. I assume it has RAID and SCSI hardisks

        If precompiling the regex makes such a big difference, you might consider trying eliminating your inner loop over the regexes by combining them into a single regex.

        What if any difference it would make will depend on how close to IO bound you are, but starting the regex engine multiple times for each line is relatively expensive. Combining the regex so they capture the matched string and then using the captured text to look up the appropriate filehandle might achieve some savings.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^5: Splitting Apache Log Files
by Marshall (Canon) on Apr 27, 2010 at 17:51 UTC
    Just as a simple I/O test, I played around with just creating a 500MB file and then copying that file to another file on my old WinXP machine.
    #!/usr/bin/perl -w use strict; my $begin = time(); open (BIG, ">fiveHundredMB") or die "cannot open file for 500MB write" +; my $c255 = '*'x255; my $c256 = "$c255\n"; my $oneK = "$c256"x4; my $oneMB = "$oneK"x1024; print BIG $oneMB for (1..500); close BIG; my $end = time(); print "elasped time for 500MB file is: ", $end-$begin, " seconds\n"; __END__ elasped time for 500MB file is: 9 seconds I opened this file in my text editor and there are 2,048,000 lines of 256 chars = 524,288,000 bytes. Windows says: 526,336,000 bytes at command line. This difference is a mystery to me at the moment. But this is basically a ~500 MB file.
    #!/usr/bin/perl -w use strict; my $begin = time(); open (BIG, "<fiveHundredMB") or die "cannot open file for 500MB read"; open (OUT, ">bigfile") or die "cannot open bigfile for write"; while (<BIG>) { print OUT $_; } close OUT; my $end = time(); print "elasped time for 500MB file is: ", $end-$begin, " seconds\n"; __END__ prints: elasped time for 500MB file is: 13 seconds
    Even at 22 seconds, the execution time seems slow, but that depends upon the number of regex'es you are running per line of input and how many lines that there are. I suspect that they are far less than 1024 bytes in length on average. If your performance is adequate for your use, I would stick a fork in it and call it done! I wouldn't worry about it. About 13-14 seconds is as fast as a single HD can go without any processing of the data.