Tanktalus has asked for the wisdom of the Perl Monks concerning the following question:

A discussion on CB earlier today with some monks whose identities I don't all recall (except for one high-ranking regular whose nick I won't bother with as it's immaterial) had the question: how to match /bar/ or /baz/. I piped up that, for maintainability reasons, I preferred /bar/ or /baz/. The regular piped up that he prefers /bar|baz/ - and said something about being faster. Which I thought was odd.

So I went about trying to Benchmark it. And I've been failing miserably to get anything reasonable. The only thing that I think I have is that /bar/ or /baz/ blows /bar|baz/ out of the water. But since I keep getting funny output, with one of them being wildly inconsistant, I thought I'd posit this to smarter monks than I, hopefully to show me the err of my ways. I did one more test prior to posting this, and I think I have orthogonal evidence that /bar/ or /baz/ actually is faster than /bar|baz/: use re 'debug' on both, and just see how many steps the regexp engine goes through. For the former, it goes through nearly no steps. For the latter, I lost track. That tells me that the former is faster for constant values. Probably different if it's /$bar/, but one step at a time, please ;-)

#!/usr/bin/perl use strict; use warnings; use Benchmark qw(cmpthese);
my $pre = 250; my $post = 250;
my $baz_match = ('b' x $pre) . 'baz' . ('b' x $post); my $bar_match = ('b' x $pre) . 'bar' . ('b' x $post); my $no_match = ('b' x $pre) . 'bza' . ('b' x $post);
my $i;
cmpthese( 750000, { using_or_match => sub { $i += ($baz_match =~ /bar/ or +$baz_match =~ /baz/) ? 1 : 0 }, using_or_nomatch => sub { $i += ($no_match =~ /bar/ or +$no_match =~ /baz/) ? 1 : 0 }, using_alt_match => sub { $i += ($baz_match =~ /bar|baz/ +) ? 1 : 0 }, using_alt_nomatch => sub { $i += ($no_match =~ /bar|baz/) + ? 1 : 0 }, }); print $i, $/;

I tried with various numbers: 50_000, 250_000, and, as above, 750_000. The addition of successes was partly to provide evidence that I was matching the right number of times, and partly because I was worried that the perl compiler was optimising away the using_or_* methods. Even then, I suspect that it is optimising something away. For 250_000, I get numbers like this:

(warning: too few iterations for a reliable count) (warning: too few iterations for a reliable count) Rate using_alt_nomatch using_alt_match using_o +r_nomatch using_or_match using_alt_nomatch 125628/s -- -53% + -98% -99% using_alt_match 265957/s 112% -- + -97% -98% using_or_nomatch 8333333/s 6533% 3033% + -- -33% using_or_match 12500000/s 9850% 4600% + 50% -- 500000

There just has to be something wrong with that using_or_match. When I bump it up to 750_000, that number just gets more ridiculous (although it gets rid of one of the warnings). However, I'm not entirely sure how to counteract it.

Now, I realise that my overall tests are probably suspect, but having run into this, without knowing how to get around it, I'm not even sure how to proceed benchmarking the difference.

Replies are listed 'Best First'.
Re: Benchmarking regex alternation
by diotalevi (Canon) on Jan 30, 2007 at 03:13 UTC

    /literal/ or /other-literal/ is going to be faster than /literal|other-literal/ because the engine can use the Boyer-Moore optimization to search the string. You'll want to look at how much it's able to skip when guessing the start of the match.

    Separately, in perl <5.10, the time required to fail an alternation scales linearly with the size of the regexp. This is still true with your explicit /foo/ or /bar/. In 5.10, the alternation uses a trie and it's clearly a winner. I'd like to be able to tell you how it scales but it's less than O(N). To be future compatible, you'll prefer alternation when combining lists of literals and in some cases for some non-literals (I can't enumerate those for you).

    The interpolation of variables is a red herring and should be costed separately. See /o is dead, long live qr//! for the scoop on that.

    Further, this is much easier to reason about if you forget the idea of benchmarking this stuff. It's all fast enough that you'll need to be especially careful to promote the testable costs above the noise. If you have knowledge of the optimizations in use in each case you can potentially design your tests around things that use them and not. I

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      One would have to benchmark to be sure, but i think that for two plain text strings /foo/ or /bar/ is going to be faster, with longer strings being even more efficient. But as you scale it up in terms of the number of alternations there will come a point where the TRIE/Aho-Corasick optimsation in perl 5.10 should win. In all honesty I did do some benchmarking to see where you win with alternation and I was surprised at how many fixed strings you need to win: around 20 with four letter words in a 500 byte string made up of lower case letters. I need to look into why that is and see if it can be improved.

      And there shouldnt really be a much of a scaling factor as you add more words to the trie. Naturally there actually are due to issues like cache spoilation and things like that. To be honest IMO the trie code is not running as fast as it could. I've been thinking that I made some bad design decisions that if changed could make a signfigant improvement in performance at the expense of some memory. Nevertheless the difference in how the old engine and new engine handle alternation can be quite drammatic. I haven't made anything worse as far as I can tell, and in most cases its measurably better.

      Update: corion asked me to expand what I meant by bad design decisions. So here goes: one key issue is that I used a compression algorithm for representing the transition table of the trie. The idea is that it gives constant time lookup but without the waste of a normal state transition table, which in the case of a trie is normally quite high, often nearly worst case at length(@w) * @a elements (@w being the words, and @a being the alphabet). Unfortunately however this compression raises the cost per lookup quite a bit, and for small alternations (say with a few short sequences) the end result is a quite a hit in terms of per operation cost for what is in the end a negligable space saving. I should have made a non compressed form for small tables and avoided the per operation cost. It most cases it wouldnt make a difference in terms of memory, but would show a nice speedup.

      The other optimisation concerns character set issues and unicode. Its easy to get into the mode of thinking of unicode as codepoints, and Im not so sure that I should have. The consequence of this is that you can end up with an extremely large alphabet. What I should have done is treated utf8 as a byte stream and done some monkey business to deal with non unicode high bit characters. This would have made the per character price cheaper for unicode, and simplified the logic in the matching code with a resulting improvement in general run time, and also reduced the need for table compression when dealing with large char sets, as the internal alphabet size would never exceed 256.

      ---
      $world=~s/war/peace/g

        Ok, yes, I admit that it's possible to benchmark this stuff but I do that only with the caveat that the alternatives have to take into account number of non-regexp opcodes dispatched by the /foo/ or /bar/ and what version of perl you're on, etc. I expect it to be an especially fiddly bit of business and I don't trust it unless it comes complete with a perlguts explanation of each part that's being used or not used.

        ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Benchmarking regex alternation
by GrandFather (Saint) on Jan 30, 2007 at 01:47 UTC

    Looks to me like the result is real. Adding a few more test cases is interesting:

    use strict; use warnings; use Benchmark qw(cmpthese); my $pre = 250; my $post = 250; my $baz_match = ('b' x $pre) . 'baz' . ('b' x $post); my $bar_match = ('b' x $pre) . 'bar' . ('b' x $post); my $no_match = ('b' x $pre) . 'bza' . ('b' x $post); print "using_or_index ", using_or_index (), "\n"; print "using_or_noindex ", using_or_noindex (), "\n"; print "using_or_match ", using_or_match (), "\n"; print "using_or_nomatch ", using_or_nomatch (), "\n"; print "using_alt_match ", using_alt_match (), "\n"; print "using_alt_nomatch ", using_alt_nomatch (), "\n"; print "using_set_match ", using_set_match (), "\n"; print "using_set_nomatch ", using_set_nomatch (), "\n"; cmpthese( -1, { using_or_index => \&using_or_index, using_or_noindex => \&using_or_noindex, using_or_match => \&using_or_match, using_or_nomatch => \&using_or_nomatch, using_alt_match => \&using_alt_match, using_alt_nomatch => \&using_alt_nomatch, using_set_match => \&using_set_match, using_set_nomatch => \&using_set_nomatch, }); sub using_or_index { (-1 != index $baz_match, 'bar') or (-1 != index $baz_match, 'baz') + ? 1 : 0 } sub using_or_noindex { (-1 != index $no_match, 'bar') or (-1 != index $no_match, 'baz') ? + 1 : 0 } sub using_or_match { ($baz_match =~ /bar/ or $baz_match =~ /baz/) ? 1 : 0 } sub using_or_nomatch { ($no_match =~ /bar/ or $no_match =~ /baz/) ? 1 : 0 } sub using_alt_match { ($baz_match =~ /bar|baz/) ? 1 : 0 } sub using_alt_nomatch { ($no_match =~ /bar|baz/) ? 1 : 0 } sub using_set_match { ($baz_match =~ /ba[rz]/) ? 1 : 0 } sub using_set_nomatch { ($no_match =~ /ba[rz]/) ? 1 : 0 }

    Prints:

    using_or_index 1 using_or_noindex 0 using_or_match 1 using_or_nomatch 0 using_alt_match 1 using_alt_nomatch 0 using_set_match 1 using_set_nomatch 0 Rate using_alt_nomatch using_alt_match using_or_ +nomatch using_or_noindex using_set_nomatch using_or_match using_or_in +dex using_set_match using_alt_nomatch 9692/s -- -22% + -99% -99% -99% -99% - +99% -99% using_alt_match 12385/s 28% -- + -98% -98% -98% -98% - +98% -99% using_or_nomatch 667281/s 6785% 5288% + -- -4% -5% -17% - +19% -26% using_or_noindex 696846/s 7090% 5527% + 4% -- -1% -13% - +16% -22% using_set_nomatch 701884/s 7142% 5567% + 5% 1% -- -13% - +15% -22% using_or_match 804828/s 8204% 6399% + 21% 15% 15% -- +-2% -10% using_or_index 825420/s 8417% 6565% + 24% 18% 18% 3% + -- -8% using_set_match 898573/s 9172% 7155% + 35% 29% 28% 12% + 9% --

    My guess is that the regex $str =~ /foo/ devolves to something pretty much like Index $str, 'foo' which can be pretty quick!

    The surprise is how fast using a character set with a common prefix match is. That result also goes a long way to validating the benchmark.


    DWIM is Perl's answer to Gödel

      the regex $str =~ /foo/ devolves to something pretty much like Index $str, 'foo'

      Yes exactly. Itll be a little slower than a index but only becuase there is a longer codepath from the regex-opcode to the actual FBM search than from the index-opcode.

      The surprise is how fast using a character set with a common prefix match is. That result also goes a long way to validating the benchmark.

      And it illustrates a new optimisation in 5.10:

      Rate using_alt_match using_or_nomatch using_alt_ +nomatch using_or_match using_alt_match 412710/s -- -29% + -30% -36% using_or_nomatch 580306/s 41% -- + -1% -9% using_alt_nomatch 585631/s 42% 1% + -- -9% using_or_match 640118/s 55% 10% + 9% -- 1631660

      Basically 5.10 is smart enough to convert /baz|bar/ into something close to /ba[rz]/. Its not quite as fast when written as an alternation due to current implmentation details, but it is very close.

      ---
      $world=~s/war/peace/g

        I wonder about /others/bar|baz|others/ if there are a lot of alternations, the probability of common prefixes decreases with the number of alternations or does it try to find common prefixes for close neighbours ?

        In the last 2-3 months on p5p was mentioned a few times the non-linearity of the regexp engine. The problem wrt to the internal utf8 representation is that at a given byte position you cannot really say what is the closest "character" position without knowing a previous correct one (the start for example is assumed ok). Is that the only cause of non-linearity? could you use markers to always keep correct "last" char. positions (cases with lots of backtracking could benefit of this, no? or is all that taken care of already in some smart way)

        Actually the "keeping all markers" trick reminds me for some reason of packrat parsing. Have you looked at packrat parsing in the context of regex? (wikipedia has links on the subject, and to the original haskell implementation of the algorithm) the algorithm seems to limit worse time behaviour of pathological NFA (non-posix) regexes to something roughly linear in the regex size

        thanks --stephan

        update: corrected typo if to is