Re^2: pattern matching with large regex

Most of the regex strings are constant, a few hundred may contain simple constructs like alternation and character classes: (f?oo|bar|baz|etc)[\w\-]*\.[0-9]{3,}) We only extract the data if it matches. As many have suggested I benchmarked a typical case with the actual data and unless something is wrong the difference is extreme:

my %cases = (
  'one_large'  => sub { if($text=~/(stuff?)m0r3(?:[^:]*\.)?($big_strin
+g)/i){my $match="$1:$2"}},
  'many_small' => sub { for(@strings){ if($text=~/(stuff?)m0r3(?:[^:]*
+\.)?($_)/i){my $match="$1:$2"}}},
);

print 
'$text       = ', length $text,       " characters\n",
'$big_string = ', length $big_string, " characters\n", 
'@strings    = ', scalar @strings,    " items\n\n";

cmpthese( 0, \%cases);
[download]

Results:

$text       = 4578 characters
$big_string = 210724 characters
@strings    = 10634 items

             Rate many_small  one_large
many_small 1.05/s         --      -100%
one_large   630/s     60089%         --
   --
[download]

Comment on Re^2: pattern matching with large regex Select or Download Code

Replies are listed 'Best First'.
Re^3: pattern matching with large regex by Tanktalus (Canon) on Aug 13, 2005 at 23:23 UTC
Not having any of the data that you're working with, all I can do is offer suggestions that may or may not help - I can't actually test them out to see that if they don't work, I can keep my mouth shut. ;-) So, I'm just curious what happens when you a) use a regexp optimiser from CPAN to "optimise" $big_string (of course, proving that the optimisation didn't break anything would be a bit painful), and b) pre-compile your @strings - e.g.: print '$text = ', length $text, " characters\n", '$big_string = ', length $big_string, " characters\n", '@strings = ', scalar @strings, " items\n\n"; my $big_regexp = Regexp::Optimizer->new()->optimize($bit_string); my @small_regexps = map { qr/$_/i } @strings; my %cases = ( 'one_large' => sub { if($text=~/(stuff?)m0r3(?:[^:]\.)?($big_regex +p)/i){my $match="$1:$2"}}, 'many_small' => sub { for(@small_regexps){ if($text=~/(stuff?)m0r3(? +:[^:]\.)?($_)/i){my $match="$1:$2"}}}, ); cmpthese( 0, \%cases); [download]	[reply] [d/l]
Re^4: pattern matching with large regex by Anonymous Monk on Aug 14, 2005 at 07:43 UTC
Pre-compiling @strings had no effect. Inherent laziness prevents me from optimising $big_string since it's plenty fast.	[reply]
Re^3: pattern matching with large regex by lidden (Curate) on Aug 13, 2005 at 21:46 UTC
In your 'one_large' example you get the first match. In 'many_small' you get the last one, try adding a `last` when you get a match in the for loop and see what happens.	[reply] [d/l]
Re^4: pattern matching with large regex by Anonymous Monk on Aug 13, 2005 at 22:10 UTC
Nice catch but `last` won't help here because a match will be the exception. Most of the time we check it all and fail to match, but in production `last` definitely belongs there.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks