comment on

I don't think that index must be slower than a regex. It seems to me that it should be much faster as it should be a much simpler function, and regexen don't do bulk alternation very quickly either. Of course you may be able to represent 100 simple text strings in fewer than 50 sub-expressions in a hand-optimized regex, so it all depends on how much work you want to have to put into this, and how often you're going to change the list. Perhaps a benchmark is in order here. Now I expect the relative speeds will vary quite a bit as the size of the string to search, the number of words in the match list, and the frequency of matches changes, but this may tell us something nonetheless (or not ;-).

It looks like index is the winner in this simple test, but I trust that wiser monks will let me know if and how I'm producing bad results!

First results when there is no match found:

Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10
+ CPU seconds...
     BigRE: 11 wallclock secs (10.56 usr +  0.00 sys = 10.56 CPU) @ 11
+690.90/s (n=123491)
     Index: 11 wallclock secs (10.61 usr +  0.00 sys = 10.61 CPU) @ 43
+921.29/s (n=465961)
     OptRE: 11 wallclock secs (10.59 usr +  0.00 sys = 10.59 CPU) @ 17
+239.59/s (n=182619)
  m// loop: 11 wallclock secs (10.50 usr +  0.00 sys = 10.50 CPU) @ 68
+25.71/s (n=71670)
[download]

Then results when there is always a match found:

Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10
+ CPU seconds...
     BigRE: 11 wallclock secs (10.39 usr +  0.00 sys = 10.39 CPU) @ 12
+548.26/s (n=130389)
     Index: 10 wallclock secs (10.53 usr +  0.00 sys = 10.53 CPU) @ 56
+956.89/s (n=599813)
     OptRE: 11 wallclock secs (10.53 usr +  0.00 sys = 10.53 CPU) @ 17
+720.63/s (n=186616)
  m// loop: 11 wallclock secs (10.48 usr +  0.00 sys = 10.48 CPU) @ 78
+73.90/s (n=82550)
[download]

Update: Damn, I knew I was missing something (probably still am...) - thanks dws! I don't keep spam around so here's the results for a chunk of 12k of text-only e-mail (as spam tends to be long-winded). I skipped the hand-optimized regex as I used nonsense words in the next case to ensure no match would be found and I didn't feel like thinking about it too hard. In both the following cases there are still just 7 words in the match list.

So, for no match:

Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10
+ CPU seconds...
     BigRE: 10 wallclock secs (10.59 usr +  0.00 sys = 10.59 CPU) @ 77
+.59/s (n=822)
     Index: 11 wallclock secs (10.50 usr +  0.00 sys = 10.50 CPU) @ 14
+27.58/s (n=14991)
  m// loop: 11 wallclock secs (10.47 usr +  0.00 sys = 10.47 CPU) @ 68
+6.50/s (n=7187)
[download]

With matches:

Benchmark: running BigRE, Index, OptRE, m// loop, each for at least 10
+ CPU seconds...
     BigRE: 10 wallclock secs (10.63 usr +  0.00 sys = 10.63 CPU) @ 20
+25.69/s (n=21523)
     Index: 11 wallclock secs (10.47 usr +  0.00 sys = 10.47 CPU) @ 19
+019.11/s (n=199092)
     OptRE: 11 wallclock secs (10.51 usr +  0.00 sys = 10.51 CPU) @ 27
+02.33/s (n=28415)
  m// loop: 11 wallclock secs (10.45 usr +  0.00 sys = 10.45 CPU) @ 95
+93.84/s (n=100294)
[download]

And the test code:

use strict;
use warnings;
use Benchmark;

my @matchlist = qw( this that these those some other too);
my $string = 'here we have a typical sentance which has a match near t
+he othr end';
my $bigregex = join '|', map quotemeta($_), @matchlist;
my $byhandre = 'th(is|at|ese|ose)|some|other|too';

timethese(shift, {
        'Index',    'SimpleIndex',
        'm// loop', 'Match',
        'BigRE',    'BigRE',
        'OptRE',    'ByHand',
                } );

sub SimpleIndex {
  for (@matchlist) { 
    return 1 if(index($string, $_) != $[-1 ); 
  }
  0;
}

sub Match {
  for (@matchlist) { 
    return 1 if( $string =~ /$_/io );  
  }
  0;
}

sub BigRE {
  return 1 if( $string =~ /$bigregex/io );
  0;
}

sub ByHand {
  return 1 if( $string =~ /$byhandre/io );
  0;
}
[download]

--
I'd like to be able to assign to an luser

In reply to Re: regex is too long by Albannach
in thread regex is too long by jhanna

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.