Alternative matches

kiat has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Alternative matches
by PodMaster (Abbot) on Oct 06, 2004 at 10:02 UTC

use Regex::PreSuf;
die presuf(split /\|/, 'new|old|number|start|simple|cross|heavy|die|ex
+it'),$/;
__END__
(?:cross|die|exit|heavy|n(?:ew|umber)|old|s(?:imple|tart))
[download]

use Benchmark 'cmpthese';

my @words = qw[ new old number start simple cross heavy die exit ];
my $string = join ' 0\4/f ', map( { rand $_ } 1 .. 60), map { $words[ 
+rand @words ] } 1 .. 20;

cmpthese( -3, {
    straight => sub {
        my $matches = () = $string =~ /new|old|number|start|simple|cro
+ss|heavy|die|exit/g;
        return;
    },
    optimial => sub {
        my $matches = () = $string =~ /(?:cross|die|exit|heavy|n(?:ew|
+umber)|old|s(?:imple|tart))/g;
        return;
    },
});

__END__
C:\>perl re.bench.pl
           Rate straight optimial
straight 1948/s       --     -15%
optimial 2288/s      17%       --

C:\>perl re.bench.pl
           Rate straight optimial
straight 1994/s       --     -14%
optimial 2322/s      16%       --

C:\>perl re.bench.pl
           Rate straight optimial
straight 1963/s       --     -16%
optimial 2324/s      18%       --
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]
[select]

Re: Alternative matches
by gaal (Parson) on Oct 06, 2004 at 09:16 UTC

(If you're constructing the regexp programmatically, take care to quotemeta the parts.)

[reply]
[d/l]

Re^2: Alternative matches

by kiat (Vicar) on Oct 06, 2004 at 09:22 UTC

Hm...I had the impression it was something to be avoided because the long list of alternatives could slow things down...

[reply]

Re^3: Alternative matches

by gaal (Parson) on Oct 06, 2004 at 09:36 UTC

Of course, if you do need to match many timesm or if the alternate lists grow huge, you will need to optimize things.

[reply]

Re^4: Alternative matches

by kiat (Vicar) on Oct 06, 2004 at 09:44 UTC

Re^3: Alternative matches

by borisz (Canon) on Oct 06, 2004 at 09:27 UTC

Boris

[reply]

Re: Alternative matches
by TedPride (Priest) on Oct 06, 2004 at 09:32 UTC

If you have a lot of terms to search for, you can probably speed things up by creating a lowercase copy of the text and searching it case sensitive.

[reply]

Re^2: Alternative matches

by TedPride (Priest) on Oct 06, 2004 at 09:49 UTC

I tested this just to make sure, and searching a string of letters 25K long 1000 times, there was a savings of approx 2 seconds (21 vs 23) by making a lowercase copy. The same difference could also be seen searching 5K text 5000 times.

[reply]

Re^3: Alternative matches

by kiat (Vicar) on Oct 06, 2004 at 10:54 UTC

Thanks, I shall remember that :)

[reply]

Re^3: Alternative matches

by revdiablo (Prior) on Oct 06, 2004 at 17:07 UTC

I tested this just to make sure . . . there was a savings of approx 2 seconds (21 vs 23)

Usually we like to see Benchmark code to back up this type of assertion. Using a the setup from PodMaster's (i.e. generating a random $string to test against), I get the following results, which seem to validate your results:

$ perl match
       Rate orig   lc
orig 1492/s   --  -7%
lc   1608/s   8%   --

$ perl match
       Rate orig   lc
orig 1506/s   --  -7%
lc   1620/s   8%   --

$ perl match
       Rate orig   lc
orig 1492/s   --  -8%
lc   1627/s   9%   --
[download]

And, in case you want to test this for yourself (or modify it to be more accurate to what you're doing), here's the code:

use strict;
use Benchmark qw(cmpthese);

my $pattern = '(new|old|number|start|simple|cross|heavy|die|exit)';

my @words = qw( new Old Number start Simple Cross heavy die Exit );
my $string = join ' 0\4/f ',
             map( { rand $_ } 1 .. 60),
             map { $words[ rand @words ] }
             1 .. 20;

cmpthese (-2, {
                orig => sub {    $string =~ /$pattern/i },
                lc   => sub { lc $string =~ /$pattern/  },
              }
         );
[download]

[reply]
[d/l]
[select]

Re: Alternative matches
by Jasper (Chaplain) on Oct 06, 2004 at 10:33 UTC

\b($string_of_words)\b

^($string_of_words)$

[reply]
[d/l]
[select]

Re: Alternative matches
by foss_city (Novice) on Oct 06, 2004 at 10:41 UTC

Isn't index() often faster for this sort of fixed-string search? If so, would it be better to chain a series of calls to index() with the || operator? (Except that since you want case-insensitivity, you would want to lc($string) first.) Just pondering . . . .

[reply]

Re^2: Alternative matches

by PodMaster (Abbot) on Oct 06, 2004 at 11:18 UTC

$array[0]=3; $array[1]=4; $array[1]=5;

2) Notice the use of capturing parenthises? It means that kiat wants to know what word he matched. To do that with index gets ugly real fast, for example

my @words = qw[ new old number start simple cross heavy die exit ];
my $string = q~whatever the NEw old you're trying to match~;
my $lcstring = lc $string;
my $ret = 0;
if( do {
        for my $val( @words ){
            my $r = index $lcstring, $val;
            if( $r > -1 ){
                $ret = substr $string, $r, length $val;
                last;
            }
        }
        $ret;
    } ){
    warn "the \$1 (aka ret) is => $ret\n\n";
}
__END__
the $1 (aka ret) is => NEw
[download]

do{}

[reply]
[d/l]
[select]

Re^3: Alternative matches

by foss_city (Novice) on Oct 06, 2004 at 11:45 UTC

    if ( index($string, 'new')    ||
        index($string, 'old')    ||
        index($string, 'number') ||
        # etc.
       ) { # etc. }
[download]

[reply]
[d/l]

Re^4: Alternative matches

by PodMaster (Abbot) on Oct 06, 2004 at 11:57 UTC

Re^5: Alternative matches

by Eimi Metamorphoumai (Deacon) on Oct 06, 2004 at 12:48 UTC

Some notes below your chosen depth have not been shown here

Re^5: Alternative matches

by foss_city (Novice) on Oct 06, 2004 at 12:39 UTC

Some notes below your chosen depth have not been shown here

Re^3: Alternative matches

by Jasper (Chaplain) on Oct 06, 2004 at 12:24 UTC

$r = index $lcstring, $val;;
$ret = substr $string, $r, length $val;
[download]

$ret = $val if -1 < index $lcstring, $val;

[reply]
[d/l]
[select]

Re^4: Alternative matches

by PodMaster (Abbot) on Oct 06, 2004 at 12:29 UTC