kiat has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

What's bad with a regex that has lots of alternative match patterns like the one below:

if ($string =~ /(new|old|number|start|simple|cross|heavy|die|exit)/i)
And relatedly, what is the more elegant way of accomplishing that sort of things?

Thanks in advance :)

Replies are listed 'Best First'.
Re: Alternative matches
by PodMaster (Abbot) on Oct 06, 2004 at 10:02 UTC
    Nothing "bad", but there is room for improvement.
    use Regex::PreSuf; die presuf(split /\|/, 'new|old|number|start|simple|cross|heavy|die|ex +it'),$/; __END__ (?:cross|die|exit|heavy|n(?:ew|umber)|old|s(?:imple|tart))
    A simple benchmark
    use Benchmark 'cmpthese'; my @words = qw[ new old number start simple cross heavy die exit ]; my $string = join ' 0\4/f ', map( { rand $_ } 1 .. 60), map { $words[ +rand @words ] } 1 .. 20; cmpthese( -3, { straight => sub { my $matches = () = $string =~ /new|old|number|start|simple|cro +ss|heavy|die|exit/g; return; }, optimial => sub { my $matches = () = $string =~ /(?:cross|die|exit|heavy|n(?:ew| +umber)|old|s(?:imple|tart))/g; return; }, }); __END__ C:\>perl re.bench.pl Rate straight optimial straight 1948/s -- -15% optimial 2288/s 17% -- C:\>perl re.bench.pl Rate straight optimial straight 1994/s -- -14% optimial 2322/s 16% -- C:\>perl re.bench.pl Rate straight optimial straight 1963/s -- -16% optimial 2324/s 18% --

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Alternative matches
by gaal (Parson) on Oct 06, 2004 at 09:16 UTC
    Who said it's bad?

    (If you're constructing the regexp programmatically, take care to quotemeta the parts.)

      Hm...I had the impression it was something to be avoided because the long list of alternatives could slow things down...
        There are probably more and less efficient possible orderings of all the elements in your set, and very probably you can transform the regexp into something with shared prefixes and stuff, but to do that correctly you also need to know/make assumptions about the distribution of your input. For general purpose programming and in most cases, just assembling something like what's in your original post is fine.

        Of course, if you do need to match many timesm or if the alternate lists grow huge, you will need to optimize things.

        Last time I checked, the list was a _lot_ faster than searching every item alone.
        Boris
Re: Alternative matches
by TedPride (Priest) on Oct 06, 2004 at 09:32 UTC
    If you have a lot of terms to search for, you can probably speed things up by creating a lowercase copy of the text and searching it case sensitive.
      I tested this just to make sure, and searching a string of letters 25K long 1000 times, there was a savings of approx 2 seconds (21 vs 23) by making a lowercase copy. The same difference could also be seen searching 5K text 5000 times.
        Thanks, I shall remember that :)
        I tested this just to make sure . . . there was a savings of approx 2 seconds (21 vs 23)

        Usually we like to see Benchmark code to back up this type of assertion. Using a the setup from PodMaster's (i.e. generating a random $string to test against), I get the following results, which seem to validate your results:

        $ perl match Rate orig lc orig 1492/s -- -7% lc 1608/s 8% -- $ perl match Rate orig lc orig 1506/s -- -7% lc 1620/s 8% -- $ perl match Rate orig lc orig 1492/s -- -8% lc 1627/s 9% --

        And, in case you want to test this for yourself (or modify it to be more accurate to what you're doing), here's the code:

        use strict; use Benchmark qw(cmpthese); my $pattern = '(new|old|number|start|simple|cross|heavy|die|exit)'; my @words = qw( new Old Number start Simple Cross heavy die Exit ); my $string = join ' 0\4/f ', map( { rand $_ } 1 .. 60), map { $words[ rand @words ] } 1 .. 20; cmpthese (-2, { orig => sub { $string =~ /$pattern/i }, lc => sub { lc $string =~ /$pattern/ }, } );
Re: Alternative matches
by Jasper (Chaplain) on Oct 06, 2004 at 10:33 UTC
    I've used Regexp::Optimizer in the past for long lists to do the character class thing like above, and it works fine. Unless you're using _very_ long lists to match, in which case constructing the regexp can take a long time (with some deep recursion thrown in).

    One other thing to note is that for the simple complete word case you give, it's probably a good idea to anchor the words with \b($string_of_words)\b or ^($string_of_words)$.
Re: Alternative matches
by foss_city (Novice) on Oct 06, 2004 at 10:41 UTC
    Isn't index() often faster for this sort of fixed-string search? If so, would it be better to chain a series of calls to index() with the || operator? (Except that since you want case-insensitivity, you would want to lc($string) first.) Just pondering . . . .
      No.
      1) You wouldn't want to chain so many calls to index (its like writing $array[0]=3; $array[1]=4; $array[1]=5;)

      2) Notice the use of capturing parenthises? It means that kiat wants to know what word he matched. To do that with index gets ugly real fast, for example

      my @words = qw[ new old number start simple cross heavy die exit ]; my $string = q~whatever the NEw old you're trying to match~; my $lcstring = lc $string; my $ret = 0; if( do { for my $val( @words ){ my $r = index $lcstring, $val; if( $r > -1 ){ $ret = substr $string, $r, length $val; last; } } $ret; } ){ warn "the \$1 (aka ret) is => $ret\n\n"; } __END__ the $1 (aka ret) is => NEw
      Granted you could turn that do{} into a function, but I don't think its worth the effort (even if there may be a slight speed boost).

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

        Oops. i missed the capturing parentheses . . . . *blushes* However, outside of that, i'm not clear on why one wouldn't want to do something like:
        if ( index($string, 'new') || index($string, 'old') || index($string, 'number') || # etc. ) { # etc. }
        except that it does seem ugly to me, now i look at it. :-) Could you elaborate, please?
        Err...

        $r = index $lcstring, $val;; $ret = substr $string, $r, length $val;
        Seems like a very obfuscated way of saying $ret = $val if -1 < index $lcstring, $val;
        No?