in reply to regex is too long

I don't think that index must be slower than a regex. It seems to me that it should be much faster as it should be a much simpler function, and regexen don't do bulk alternation very quickly either. Of course you may be able to represent 100 simple text strings in fewer than 50 sub-expressions in a hand-optimized regex, so it all depends on how much work you want to have to put into this, and how often you're going to change the list. Perhaps a benchmark is in order here. Now I expect the relative speeds will vary quite a bit as the size of the string to search, the number of words in the match list, and the frequency of matches changes, but this may tell us something nonetheless (or not ;-).

It looks like index is the winner in this simple test, but I trust that wiser monks will let me know if and how I'm producing bad results!

First results when there is no match found:

Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10 + CPU seconds... BigRE: 11 wallclock secs (10.56 usr + 0.00 sys = 10.56 CPU) @ 11 +690.90/s (n=123491) Index: 11 wallclock secs (10.61 usr + 0.00 sys = 10.61 CPU) @ 43 +921.29/s (n=465961) OptRE: 11 wallclock secs (10.59 usr + 0.00 sys = 10.59 CPU) @ 17 +239.59/s (n=182619) m// loop: 11 wallclock secs (10.50 usr + 0.00 sys = 10.50 CPU) @ 68 +25.71/s (n=71670)
Then results when there is always a match found:
Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10 + CPU seconds... BigRE: 11 wallclock secs (10.39 usr + 0.00 sys = 10.39 CPU) @ 12 +548.26/s (n=130389) Index: 10 wallclock secs (10.53 usr + 0.00 sys = 10.53 CPU) @ 56 +956.89/s (n=599813) OptRE: 11 wallclock secs (10.53 usr + 0.00 sys = 10.53 CPU) @ 17 +720.63/s (n=186616) m// loop: 11 wallclock secs (10.48 usr + 0.00 sys = 10.48 CPU) @ 78 +73.90/s (n=82550)

Update: Damn, I knew I was missing something (probably still am...) - thanks dws! I don't keep spam around so here's the results for a chunk of 12k of text-only e-mail (as spam tends to be long-winded). I skipped the hand-optimized regex as I used nonsense words in the next case to ensure no match would be found and I didn't feel like thinking about it too hard. In both the following cases there are still just 7 words in the match list.

So, for no match:

Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10 + CPU seconds... BigRE: 10 wallclock secs (10.59 usr + 0.00 sys = 10.59 CPU) @ 77 +.59/s (n=822) Index: 11 wallclock secs (10.50 usr + 0.00 sys = 10.50 CPU) @ 14 +27.58/s (n=14991) m// loop: 11 wallclock secs (10.47 usr + 0.00 sys = 10.47 CPU) @ 68 +6.50/s (n=7187)
With matches:
Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10 + CPU seconds... BigRE: 10 wallclock secs (10.63 usr + 0.00 sys = 10.63 CPU) @ 20 +25.69/s (n=21523) Index: 11 wallclock secs (10.47 usr + 0.00 sys = 10.47 CPU) @ 19 +019.11/s (n=199092) OptRE: 11 wallclock secs (10.51 usr + 0.00 sys = 10.51 CPU) @ 27 +02.33/s (n=28415) m// loop: 11 wallclock secs (10.45 usr + 0.00 sys = 10.45 CPU) @ 95 +93.84/s (n=100294)

And the test code:

use strict; use warnings; use Benchmark; my @matchlist = qw( this that these those some other too); my $string = 'here we have a typical sentance which has a match near t +he othr end'; my $bigregex = join '|', map quotemeta($_), @matchlist; my $byhandre = 'th(is|at|ese|ose)|some|other|too'; timethese(shift, { 'Index', 'SimpleIndex', 'm// loop', 'Match', 'BigRE', 'BigRE', 'OptRE', 'ByHand', } ); sub SimpleIndex { for (@matchlist) { return 1 if(index($string, $_) != $[-1 ); } 0; } sub Match { for (@matchlist) { return 1 if( $string =~ /$_/io ); } 0; } sub BigRE { return 1 if( $string =~ /$bigregex/io ); 0; } sub ByHand { return 1 if( $string =~ /$byhandre/io ); 0; }

--
I'd like to be able to assign to an luser

Replies are listed 'Best First'.
Re: Re: regex is too long
by dws (Chancellor) on May 08, 2001 at 22:21 UTC
    How do the methods compare if the target string is longer? Say, about the size of an average piece of SPAM.

      I've thought about this... To make this work I'll have to chunk the file because I don't want to keep the whole email in memory -- I'd rather stay lean. I could chunk the thing into 100 line pieces, but this makes the code messier.