Chivalri has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I posted this question on Perl Guru, and someone suggested I might want to try here instead for a potentially better response.

I have written some perl code dealing with large amounts of data. The script reads in one line at a time from a flat file, then searches the line for any reference to information from another flat file. This creates a nested loop, and when one flat file is 300mb+ and the other is 1mb, it takes a while to run. Any suggestions for improving the performance of this code?

Thanks!

The offending code:
#%hash has been filled with some meta data in an array read in #from the smaller flat file #The first element is the name we are searching for. #the rest are irrelevant to this loop # #fh_log is a file handle to the large 300mb+ flat file #each line may contain references to 1 or more table names, and #I need to capture all of them. while($cur_line=<fh_log>) { foreach $key (keys %hash) { $table = $hash{$key}[0]; #If this query contains reference to this table if($cur_line=~ m/$table/i) { #count up how many times this table is referenced. $table_stats{$table}++; $table_refs++; #total table references (for % calc) } } }

Replies are listed 'Best First'.
Re: Perl Optimization
by BrowserUk (Patriarch) on Aug 09, 2008 at 00:45 UTC

    Invert your search logic. Instead of using your hash keys as an array, use it as a hash and benefit from it's O(1) lookup abilities:

    my %hash = ...; while( $line = <fh_log> ) { for( split ' ', $line ) { if( exists $hash{ $_ } ) { $stats{ $_ }++; $total++; } } }

    Let's assume your 1 MB file results in 8000 keys. And your 300MB file contains 2.5 million lines. Your way, you are performing 20 billion O(n) regex searches against the length of the lines in the big file.

    If those lines are 128 characters and split into say 16 words, this way you perform 4 million O(1) hash lookups. That's (much) less than 1/5,000 the amount of work/time.

    If you can construct a regex to extract only likely tablename candidates from your lines and can reduce the 16, to say 4, you could get that down to 1/20,000 or 0.005%

    If the casing can vary between the table names in the hash keys and the table names in the big file, as implied by your use of /i, lower (or upper) case both before lookup.

    You should never iterate the keys of a hash when searching, if there is any possibility of not doing so.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thats a great idea which had not occured to me. I was not in fact using precompiled regexps, but when I switched to them it seemed to slow down my code (from 25-30 secs on my test data set to 3+ minutes). I'll give the search reversal a try and let you know how it turns out. Updated code using precompiled regex:
      my %htable_re; #reading in small file while($cur_line=<size_log>) { my @command=split(/;;;+/,$cur_line); $htable_re{$command[0]}=qr/\b($command[0])\b/i #parse remainder of line... } #pattern match against big file while(<fh_log>) { foreach $key (keys %htable_re) { if(/$htable_re{$key}/ ) #If this query contains reference to t +his table { $table_stats{$key}++; #count up how many times this table +is referenced. $table_refs++; } } }
Re: Perl Optimization
by ikegami (Patriarch) on Aug 09, 2008 at 00:22 UTC

    Is $table a compiled regexp? If it's not, you're compiling regexps over and over again for nothing. In fact, there's no reason to have multiple regexps.

    use Regexp::List qw( ); my $re = Regexp::List->new(modifiers => 'i') ->list2re( map lc($hash{$_}[0]), keys %hash ); # or: # my ($re) = map qr/$_/i, # join '|', # map quotemeta(lc($hash{$_}[0])), # keys %hash; while (my $cur_line = <$fh_log>) { while ($cur_line =~ /($re)/g) { my $table = lc($1); if ($tables{$table}) { # Count up how many times this table is referenced. $table_stats{$table}++; $table_refs++; } } }

    An alternative solution would be to extract the words in $cur_line and look for them in a hash.

    my %tables = map { lc($hash{$_}[0]) => 1 } keys %hash; while (my $cur_line = <$fh_log>) { while ($cur_line =~ /(\w+)/g) { my $table = lc($1); if ($tables{$table}) { # Count up how many times this table is referenced. $table_stats{$table}++; $table_refs++; } } }

    I suspect the first method is better.

Re: Perl Optimization
by GrandFather (Saint) on Aug 09, 2008 at 00:09 UTC

    Using a single precompiled regex may help:

    my $matchStr = join '|', map {$hash{$_}[0]} keys %hash; my $match = qr/($matchStr)/i; while (my $cur_line = <fh_log>) { #If this query contains reference to this table next if $cur_line !~ $match; #count up how many times this table is referenced. $table_stats{$1}++; $table_refs++; #total table references (for % calc) }

    Perl reduces RSI - it saves typing

      Oops, I took (way) too long to write my post. But there are a couple of bugs in yours.

      1. $table_stats{$1}++; should be $table_stats{lc($1)}++;

      2. You only count one table per line, whereas the OP counts every table per line.

      Also, I used Regexp::List to improve performance when there are similar table names (fairly likely). That feature is already built into Perl 5.10, though.

Re: Perl Optimization
by jmcada (Acolyte) on Aug 09, 2008 at 00:24 UTC
    If you only care about the first match, try exiting the loop as soon as you find the match, that should save time. Also, if you know which keys will be found most often, you might want to try to match them first instead of relying on the return order of the 'keys' function.
    while($cur_line=<fh_log>) 
    { 
       foreach $key (keys %hash) 
       { 
          $table = $hash{$key}[0]; 
    
          #If this query contains reference to this table 
          if($cur_line=~ m/$table/i) 
          {     
              #count up how many times this table is referenced. 
          $table_stats{$table}++; 
          $table_refs++;     #total table references (for % calc) 
          last; #added to exit inner loop upon match
          } 
       }
    }
    
Re: Perl Optimization
by jethro (Monsignor) on Aug 09, 2008 at 00:29 UTC

    One optimization depends on what $table and the lines in fg_log look like. If you can pre-sort the strings you are looking for into a fixed amount of buckets, you can check the $cur_line for which buckets could possible match it and so avoid matching against all the unsuitable buckets

    As an example lets assume that $table always holds 10 digit numbers and each fh_log line consists of text and a few of those 10 digit numbers inbetween.

    Then you could put all numbers beginning with 00 into bucket 00, all the numbers beginning with 01 into the bucket 01 and so on. Then you just need to extract the numbers from every line and check them only against the numbers in the suitable bucket.

    The same optimization could be done if you always look for whole words, or sequences that always have a '-' inbetween (use the char before and after the '-') and a lot of other situations, you just need some structure in your data you can take advantage of.