in reply to Unefficient Regexp 'Matching this or that'

You might consider using separate passes for each string to be found (or a loop over them). The following, with values suited to a large file I had kicking around, ran substantially faster than the alernation regex:

sub two_pass { my @matched; open (my $FH, "<$file"); while (my $rec = <$FH>) { if ( $rec =~ '500_000' ) { push @matched, $rec; } if ( $rec =~ '500_001' ) { push @matched, $rec; } } print "two pass: " . @matched; close $FH; }

Actually, it consistently runs faster than the single string sub, even if I force it to run first, last or in the middle. Whatever position, the timings of all three methods remain very consistent I don't have an explanation for that? I've noticed in the past that with the 5.10+ trie optimisation, alternation regexes are sometimes slower than they are under 5.8. There seems to be a "sweet spot" where it comes into its own: neither too few nor too many alternations and it flies, but below or above those limits and it actually takes longer.

For example, on 5.10, I get these (quite surprising) results consistently:

c:\test>junk31 836952.log Benchmark: timing 1 iterations of AAtwo_pass, BBtwo_string, CCone_string ... two pass: 10 AAtwo_pass: 6 wallclock secs ( 5.69 usr + 0.78 sys = 6.47 CPU) @ 0 +.15/s (n=1) (warning: too few iterations for a reliable count) two string: 10 BBtwo_string: 15 wallclock secs (14.34 usr + 0.73 sys = 15.07 CPU) @ + 0.07/s (n=1) (warning: too few iterations for a reliable count) one string: 7 CCone_string: 15 wallclock secs (14.76 usr + 0.76 sys = 15.52 CPU) @ + 0.06/s (n=1) (warning: too few iterations for a reliable count)

But on 5.8 I get these rather more understandable results:

c:\test>\perl32\bin\perl junk31.pl 836952.log Benchmark: timing 1 iterations of AAtwo_pass, BBtwo_string, CCone_string ... two pass: 10 AAtwo_pass: 8 wallclock secs ( 6.60 usr + 0.69 sys = 7.29 CPU) @ 0 +.14/s (n=1) (warning: too few iterations for a reliable count) two string: 10 BBtwo_string: 53 wallclock secs (52.99 usr + 0.70 sys = 53.70 CPU) @ + 0.02/s (n=1) (warning: too few iterations for a reliable count) one string: 7 CCone_string: 7 wallclock secs ( 5.60 usr + 0.92 sys = 6.52 CPU) @ + 0.15/s (n=1) (warning: too few iterations for a reliable count)

Maybe it shoudl be possible to disable the trie optimisation for thos cases of alternation where it doesn't benefit.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^2: Unefficient Regexp 'Matching this or that'
by moritz (Cardinal) on May 19, 2010 at 10:28 UTC
    Maybe it shoudl be possible to disable the trie optimisation for thos cases of alternation where it doesn't benefit.
    Maybe you can disable trie optimizations by setting ${^RE_TRIE_MAXBUF} = 0?
    Perl 6 - links to (nearly) everything that is Perl 6.

      That seems to work. !5 seconds instead of 53 for the alternation is much more reasonable. Still doesn't beat the two pass method, but there is probably a break point somewhere.

      c:\test>junk31 836952.log Benchmark: timing 1 iterations of AAtwo_pass, BBtwo_string, CCone_string ... two pass: 10 AAtwo_pass: 7 wallclock secs ( 5.60 usr + 0.83 sys = 6.43 CPU) @ 0 +.16/s (n=1) (warning: too few iterations for a reliable count) two string: 10 BBtwo_string: 15 wallclock secs (14.35 usr + 0.94 sys = 15.29 CPU) @ + 0.07/s (n=1) (warning: too few iterations for a reliable count) one string: 7 CCone_string: 16 wallclock secs (14.76 usr + 0.73 sys = 15.49 CPU) @ + 0.06/s (n=1) (warning: too few iterations for a reliable count)

      I'm still puzzled why the two pass takes less time than the one string, but the numbers are right and I can't anything major wrong with the benchmark?