Perl Optimization

Chivalri has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl Optimization by BrowserUk (Patriarch) on Aug 09, 2008 at 00:45 UTC
Invert your search logic. Instead of using your hash keys as an array, use it as a hash and benefit from it's O(1) lookup abilities: `my %hash = ...; while( $line = <fh_log> ) { for( split ' ', $line ) { if( exists $hash{ $_ } ) { $stats{ $_ }++; $total++; } } }` [download] Let's assume your 1 MB file results in 8000 keys. And your 300MB file contains 2.5 million lines. Your way, you are performing 20 billion O(n) regex searches against the length of the lines in the big file. If those lines are 128 characters and split into say 16 words, this way you perform 4 million O(1) hash lookups. That's (much) less than 1/5,000 the amount of work/time. If you can construct a regex to extract only likely tablename candidates from your lines and can reduce the 16, to say 4, you could get that down to 1/20,000 or 0.005% If the casing can vary between the table names in the hash keys and the table names in the big file, as implied by your use of /i, lower (or upper) case both before lookup. You should never iterate the keys of a hash when searching, if there is any possibility of not doing so. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: Perl Optimization by Chivalri (Initiate) on Aug 11, 2008 at 20:26 UTC
Thats a great idea which had not occured to me. I was not in fact using precompiled regexps, but when I switched to them it seemed to slow down my code (from 25-30 secs on my test data set to 3+ minutes). I'll give the search reversal a try and let you know how it turns out. Updated code using precompiled regex: `my %htable_re; #reading in small file while($cur_line=<size_log>) { my @command=split(/;;;+/,$cur_line); $htable_re{$command[0]}=qr/\b($command[0])\b/i #parse remainder of line... } #pattern match against big file while(<fh_log>) { foreach $key (keys %htable_re) { if(/$htable_re{$key}/ ) #If this query contains reference to t +his table { $table_stats{$key}++; #count up how many times this table +is referenced. $table_refs++; } } }` [download]	[reply] [d/l]
Re: Perl Optimization by ikegami (Patriarch) on Aug 09, 2008 at 00:22 UTC
Is `$table` a compiled regexp? If it's not, you're compiling regexps over and over again for nothing. In fact, there's no reason to have multiple regexps. `use Regexp::List qw( ); my $re = Regexp::List->new(modifiers => 'i') ->list2re( map lc($hash{$_}[0]), keys %hash ); # or: # my ($re) = map qr/$_/i, # join '\|', # map quotemeta(lc($hash{$_}[0])), # keys %hash; while (my $cur_line = <$fh_log>) { while ($cur_line =~ /($re)/g) { my $table = lc($1); if ($tables{$table}) { # Count up how many times this table is referenced. $table_stats{$table}++; $table_refs++; } } }` [download] An alternative solution would be to extract the words in $cur_line and look for them in a hash. `my %tables = map { lc($hash{$_}[0]) => 1 } keys %hash; while (my $cur_line = <$fh_log>) { while ($cur_line =~ /(\w+)/g) { my $table = lc($1); if ($tables{$table}) { # Count up how many times this table is referenced. $table_stats{$table}++; $table_refs++; } } }` [download] I suspect the first method is better.	[reply] [d/l] [select]
Re: Perl Optimization by GrandFather (Saint) on Aug 09, 2008 at 00:09 UTC
Using a single precompiled regex may help: `my $matchStr = join '\|', map {$hash{$_}[0]} keys %hash; my $match = qr/($matchStr)/i; while (my $cur_line = <fh_log>) { #If this query contains reference to this table next if $cur_line !~ $match; #count up how many times this table is referenced. $table_stats{$1}++; $table_refs++; #total table references (for % calc) }` [download] Perl reduces RSI - it saves typing	[reply] [d/l]
Re^2: Perl Optimization by ikegami (Patriarch) on Aug 09, 2008 at 00:29 UTC
Oops, I took (way) too long to write my post. But there are a couple of bugs in yours. 1. `$table_stats{$1}++;` should be `$table_stats{lc($1)}++;` 2. You only count one table per line, whereas the OP counts every table per line. Also, I used Regexp::List to improve performance when there are similar table names (fairly likely). That feature is already built into Perl 5.10, though.	[reply] [d/l] [select]
Re: Perl Optimization by jmcada (Acolyte) on Aug 09, 2008 at 00:24 UTC
If you only care about the first match, try exiting the loop as soon as you find the match, that should save time. Also, if you know which keys will be found most often, you might want to try to match them first instead of relying on the return order of the 'keys' function. while($cur_line=<fh_log>) { foreach $key (keys %hash) { $table = $hash{$key}[0]; #If this query contains reference to this table if($cur_line=~ m/$table/i) { #count up how many times this table is referenced. $table_stats{$table}++; $table_refs++; #total table references (for % calc) last; #added to exit inner loop upon match } } }	[reply]
Re: Perl Optimization by jethro (Monsignor) on Aug 09, 2008 at 00:29 UTC
One optimization depends on what $table and the lines in fg_log look like. If you can pre-sort the strings you are looking for into a fixed amount of buckets, you can check the $cur_line for which buckets could possible match it and so avoid matching against all the unsuitable buckets As an example lets assume that $table always holds 10 digit numbers and each fh_log line consists of text and a few of those 10 digit numbers inbetween. Then you could put all numbers beginning with 00 into bucket 00, all the numbers beginning with 01 into the bucket 01 and so on. Then you just need to extract the numbers from every line and check them only against the numbers in the suitable bucket. The same optimization could be done if you always look for whole words, or sequences that always have a '-' inbetween (use the char before and after the '-') and a lot of other situations, you just need some structure in your data you can take advantage of.	[reply]