in reply to Re: How to improve speed of reading big files
in thread How to improve speed of reading big files

Thanks for your reply, it helped a lot.
I gained 10 sec on the filterLog function and 10 more on the mergeLogs (without counting the filterLog part).

Here is the final version of the code:

sub mergeLogs { my ($day, @files) = @_; my @lines; foreach my $file (@files) { my $fh = openLogFile($file); if (! defined $fh) { warn "$0: ignoring file $file\n"; next; } warn "-> processing $file\n" if $opts{'verbose'} > 0; push @lines, grep { /Running|Dump|FromCB|Update/o && &filterLo +g } <$fh>; close $fh; } use Benchmark 'cmpthese'; cmpthese( -1, { 'ST' => sub { ( sortLinesST(\@lines) )[ 0 ] }, 'GRT' => sub { ( sortLinesGRT(\@lines) )[ 0 ] }, 'GRT2' => sub { ( sortLinesGRT2(\@lines))[ 0 ] }, 'GRT3' => sub { ( sortLinesGRT3(\@lines) )[ 0 ] }, } ); exit; } sub filterLog { return 0 if exists $opts{'day'} && ! /^$opts{'day'}/o; if (! exists $opts{'server'}) { if (/\* Running on (\w+) -/) { $opts{'server'} = lc $1; return 0; } } if (exists $opts{'start-time'} || exists $opts{'stop-time'}) { if (/(\d{2}:\d{2}:\d{2}\.\d{3})/o) { return 0 if exists $opts{'start-time'} && $1 lt $opts{'sta +rt-time'}; return 0 if exists $opts{'stop-time'} && $1 gt $opts{'stop +-time'}; } } return 0 if exists $opts{'user'} && ! /[\(\[]\s*(?:$opts{'user'})/ +o; s/ {2,}/ /go; s/ ?: / /go; s/^((?:\S+ ){3}).+?\[?I\]?:/$1/o; s/ ACK (\w) / $1 ACK /o; warn $_ if $opts{'verbose'} > 3; return 1; } sub sortLinesST { my $href = shift; return [ map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_, +/(\d{2}:\d{2}:\d{2}\.\d{3})/o ] } @$href ]; } sub sortLinesGRT { my $href = shift; return [ map { substr($_, 12) } sort map { /(\d{2}:\d{2}:\d{2}\.\d +{3})/o; $1 . $_ } @$href ]; } sub sortLinesGRT2 { my $href = shift; return [ map { substr($_, 4) } sort map { /(\d{2}):(\d{2}):(\d{2}) +\.(\d{3})/o; pack ('NA*', ( $1*60 + $2 )*60 + "$3.$4" ) . $_ } @$href + ]; } sub sortLinesGRT3 { my $href = shift; return [ @$href[ map { unpack "N", substr($_, -4) } sort map { $hr +ef->[$_] =~ /(\d{2}:\d{2}:\d{2}\.\d{3})/o; $1 . pack ("N", $_) } 0..$ +#$href ] ]; }

I ran some benchmarks to find the appropriate sort technique, here are the results:

Rate GRT2 ST GRT3 GRT GRT2 30.2/s -- -19% -42% -45% ST 37.1/s 23% -- -28% -32% GRT3 51.9/s 72% 40% -- -5% GRT 54.6/s 81% 47% 5% --

I would think first that the GRT3 sub would be the fastest using the index sorting. But apparently not.
However i do not understand why the GRT2 sub is slower than the ST sub ?

Replies are listed 'Best First'.
Re^3: How to improve speed of reading big files
by BrowserUk (Patriarch) on Sep 18, 2009 at 18:47 UTC
    I would think first that the GRT3 sub would be the fastest using the index sorting. But apparently not.

    There is a famous quote (that I cannot actually remember :), that says something to the effect that programmers intuition about where the bottlenecks occur in their code is usually wrong. I've found that this is never more true than with Perl.

    As sort is essentially only shuffling pointers and aliases around, whether those point to short string or long ones makes little difference.

    When you have the full record appended to the sort key, if the first N chars are identical, (as they often will be if all your records start with the date 'yyyymmdd' and you are selecting by the date), then the comparison will have to process a bit further through the string before deciding their relationship, than when you have the unique binary index appended to the sort key where it will go at most 4 bytes beyond the key. Making the index sort a little faster. However, you then have to do a full array slicing operation to reorder the records which is comparatively more expensive than scanning a few extra bytes on each comparison.

    However i do not understand why the GRT2 sub is slower than the ST sub ?

    It's all down to the number of Perl primitives I guess. One capture versus 4 in the regex, plus the math is enough to tip the balance in favour of the ST for lowish numbers (~6000 in my quick runs of your benchmark) of records.

    However, your GRT2 implementation is flawed in a several ways:

    pack ('NA*', ( $1*60 + $2 )*60 + "$3.$4" ) . $_

    1. You're concatenating the full record to the end of the pack'd key, so the 'A*' part of the template is redundant.

      That should wither be pack ('NA*', ( $1*60 + $2 )*60 + "$3.$4", $_ )

      or pack ('N, ( $1*60 + $2 )*60 + "$3.$4" ) . $_

    2. More importantly, by using 'N' rather than 'd' as I did, you are effectively inting the seconds, which means times that are within the same second will not be ordered correctly.

      However, I understand why you probably did that. The IEEE 754 format is in-part specifically designed to allow the binary representation to sort correctly. However, that does not take into account that the Intel native representation byte-swaps it. So, in order to sort packed binary doubles correctly, you have to undo the byte-swapping.

    3. And that also highlights a second stupidity of mine that you've carried through.

      Why capture the seconds an microseconds as separate bits and then stick them back together, when you can capture them as a single piece!

    GRT2 is better coded as:

    sub sortLinesGRT2 { my $href = shift; return [ map { unpack 'x[d]a*', $_ } sort map { /(\d{2}):(\d{2}):(\d{2}\.\d{3})/o; scalar( reverse pack ('d', ( $1*60 + $2 )*60 + $3 ) ) . $_ } @$href ]; }

    This corrects the errors and sorts correctly. Surprisingly, it also runs a bit quicker than your original, overtaking the ST at around 1500 records rather than 6000. But it is still way slower than the straightforward GRT implementation:

    C:\test>for /l %i in (1,1,5) do @795905 -L=1e%i Rate GRT2 GRT3 ST GRT GRT2 15665/s -- -26% -26% -45% GRT3 21060/s 34% -- 0% -26% ST 21060/s 34% 0% -- -26% GRT 28550/s 82% 36% 36% -- Rate GRT2 ST GRT3 GRT GRT2 1687/s -- -14% -29% -48% ST 1969/s 17% -- -18% -39% GRT3 2390/s 42% 21% -- -26% GRT 3251/s 93% 65% 36% -- Rate GRT2 ST GRT3 GRT GRT2 166/s -- -1% -28% -47% ST 169/s 2% -- -27% -46% GRT3 232/s 40% 38% -- -26% GRT 312/s 88% 85% 34% -- Rate ST GRT2 GRT3 GRT ST 13.4/s -- -16% -39% -53% GRT2 16.0/s 20% -- -27% -44% GRT3 22.0/s 64% 37% -- -23% GRT 28.6/s 113% 78% 30% -- (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate ST GRT2 GRT3 GRT ST 0.986/s -- -28% -45% -56% GRT2 1.38/s 40% -- -23% -39% GRT3 1.78/s 80% 29% -- -21% GRT 2.26/s 129% 64% 27% --

    Update: Finally, you are currently parsing the time twice: once in the filter sub and again for the GRT. If you prepended the sort key within the filter sub, you could avoid another complete pass of records.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: How to improve speed of reading big files
by BrowserUk (Patriarch) on Sep 18, 2009 at 22:20 UTC

    If you have the time and inclination, try this and see how you fare. You might have to make a few tweaks, it compiles clean but is otherwise untested beyond a mental run through, which given my mind is notoriously unreliable:)

    sub mergeLogs { my ($day, @files) = @_; my @lines; foreach my $file (@files) { my $fh = openLogFile($file); warn "$0: ignoring file $file\n", next unless defined $fh; warn "-> processing $file\n" if $opts{'verbose'} > 0; while( <$fh> ) { next unless /Running|Dump|FromCB|Update/o; next if exists $opts{'day'} && ! /^$opts{'day'}/o; next if exists $opts{'user'} && ! /[\(\[]\s*(?:$opts{'user +'})/o; $opts{'server'} = lc $1, next if !exists $opts{'server'} && /\* Running on (\w+) -/; my $time = /(\d{2}:\d{2}:\d{2}\.\d{3})/o; next if exists $opts{'start-time'} && $time lt $opts{'star +t-time'}; next if exists $opts{'stop-time'} && $time gt $opts{'stop +-time'}; s/ {2,}/ /go; s/ ?: / /go; s/^((?:\S+ ){3}).+?\[?I\]?:/$1/o; s/ ACK (\w) / $1 ACK /o; warn $_ if $opts{'verbose'} > 3; ## prepend the time key now save reparsing later push @lines, $time . $_; } close $fh; } ## Sort in-place to save memory shuffling ## (if it decides to play ball today) @lines = sort @lines; substr $_, 0, 12, '' for @lines; ## Trim the prepended keys return \@lines; }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I used the above code with some minor modifications : i remove the day at the beginning of each line to have the time in front -> ready to be sorted and no substr needed!

      results are:

      • - before (old old old code, no tweaks) --> 84sec
      • - new and improved code --> 17 sec

      The biggest gain (more than 40 sec) was on using the while on the handle instead of the list context with grep

      Thanks a lot !!!

        Chopping off the date instead of prepending/sorting/removing the time--cool!


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.