in reply to Perl Optimization

Invert your search logic. Instead of using your hash keys as an array, use it as a hash and benefit from it's O(1) lookup abilities:

my %hash = ...; while( $line = <fh_log> ) { for( split ' ', $line ) { if( exists $hash{ $_ } ) { $stats{ $_ }++; $total++; } } }

Let's assume your 1 MB file results in 8000 keys. And your 300MB file contains 2.5 million lines. Your way, you are performing 20 billion O(n) regex searches against the length of the lines in the big file.

If those lines are 128 characters and split into say 16 words, this way you perform 4 million O(1) hash lookups. That's (much) less than 1/5,000 the amount of work/time.

If you can construct a regex to extract only likely tablename candidates from your lines and can reduce the 16, to say 4, you could get that down to 1/20,000 or 0.005%

If the casing can vary between the table names in the hash keys and the table names in the big file, as implied by your use of /i, lower (or upper) case both before lookup.

You should never iterate the keys of a hash when searching, if there is any possibility of not doing so.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Perl Optimization
by Chivalri (Initiate) on Aug 11, 2008 at 20:26 UTC
    Thats a great idea which had not occured to me. I was not in fact using precompiled regexps, but when I switched to them it seemed to slow down my code (from 25-30 secs on my test data set to 3+ minutes). I'll give the search reversal a try and let you know how it turns out. Updated code using precompiled regex:
    my %htable_re; #reading in small file while($cur_line=<size_log>) { my @command=split(/;;;+/,$cur_line); $htable_re{$command[0]}=qr/\b($command[0])\b/i #parse remainder of line... } #pattern match against big file while(<fh_log>) { foreach $key (keys %htable_re) { if(/$htable_re{$key}/ ) #If this query contains reference to t +his table { $table_stats{$key}++; #count up how many times this table +is referenced. $table_refs++; } } }