Optimize a pluggable regex matching mechanism

dredd has asked for the wisdom of the Perl Monks concerning the following question:

Here's one for regex ninjas.

Let's look at this example, a simple grep:

my $pat = shift;

while (<>) {
    print if /$pat/;
}
[download]

As you certainly all know, Perl likes to build the machine for a regex when it compiles the program. For regexes with variables in them, it rebuilds the machine every time it uses the regex. So here, Perl rebuilds the machine every time through the loop, which makes this program really slow.

You certainly also know that this program can be optimized with the o-flag:

my $pat = shift;

while (<>) {
    print if /$pat/o;
}
[download]

This tells Perl that $pat never changes. Perl will compile the machine for /$pat/ the first time it uses the regex and then remember it for later.

The situation is a bit more complicated if you want to match more than one pattern, e.g.:

my @pats = ('fo*', 'ba.', 'w+3');

while (<>) {
    foreach $pat (@pats) {
        print if /$pat/;
    }
}
[download]

Obviously the /../o trick won't work here, because then only the first pattern would be compiled. But this program can be made much faster by joining all patterns into one:

my @pats = ('fo*', 'ba.', 'w+3');
my $pat = join('|', @pats);

while (<>) {
    print if /$pat/o;
}
[download]

So far, so good. Now for my problem. Let's asume we have a little plugin system and plugins can register functions to be called when a given regex matches a line.

Our program will store pairs of </pattern/i, funcref> in a hash %patterns and then basically do something like this:

for my $line (@lines) {
    for my $pattern (keys(%patterns)) {
        if (my @params = ($line =~ $pattern)) {
            my $func = $patterns{$pattern};
            if (defined($func)) {
                $func->(@params);
            }
        }
    }
}
[download]

Again, this is very slow, because Perl needs to rebuild the machine for the regex every time through the loop, for every line. But the problem is: I can't use the trick shown above (join all patterns into one string with '|') because then I can't decide which pattern in the string matched and so I don't know which function to call.

I tried a lot of different things without success, so now I hope for the expertise of the Perl Monks. How could this mechanism be optimized? Is there any way at all? Perhaps my approach is total bullshit and there's a much better one to do this? I'm looking forward to your ideas!

(Examples taken from http://perl.plover.com/Regex/article.html

Comment on Optimize a pluggable regex matching mechanism Select or Download Code

Replies are listed 'Best First'.
Re: Optimize a pluggable regex matching mechanism by ikegami (Patriarch) on Nov 09, 2009 at 22:46 UTC
For regexes with variables in them, it rebuilds the machine every time it uses the regex. No it doesn't. If the contents of the interpolated variable doesn't change, the pattern isn't recompiled. `$ perl -Mre=debug -e'/$_/ for 1..2' 2>&1 \| grep Compiling Compiling REx "1" Compiling REx "2" $ perl -Mre=debug -e'$x="foo"; /$x/ for 1..2' 2>&1 \| grep Compiling Compiling REx "foo"` [download] You certainly also know that this program can be optimized with the o-flag: The following would be fine since `$pat` is the same for every pass of the loop: `my $pat = 'id\\d+'; while (<>) { print if /$pat/; }` [download] You could also use qr// instead of /o. `my $pat = qr/id\d+/; while (<>) { print if /$pat/; }` [download] It's even simpler to write patterns using qr// than as strings. (Note the extra slash above.) Obviously the /../o trick won't work here, Yet another reason to use qr//. `my @pats = ( qr/fo/, qr/ba./, qr/w+3/ ); while (<>) { for my $pat (@pats) { print if /$pat/; } }` [download] But the problem is: I can't use the trick shown above (join all patterns into one string with '\|')* No need for it `my @handlers = ( { pattern => qr/.../, callback => sub { ... } }, { pattern => qr/.../, callback => sub { ... } }, { pattern => qr/.../, callback => sub { ... } }, ); for my $line (@lines) { for my $handler (@handlers) { if (my @params = ($line =~ $handler->{$pattern})) { my $func = $handler->{callback}; $func->(@params) if $func; } } }` [download] By using an array instead of hash, you even get the bonus of being able to place the most commonly matched handlers first.	[reply] [d/l] [select]
Re: Optimize a pluggable regex matching mechanism by moritz (Cardinal) on Nov 09, 2009 at 22:24 UTC
You can use the experimental `(?{...})` code groups in the generated regex: `use strict; use warnings; use 5.010; use re 'eval'; my %h = ( '^foo' => sub { say 'foo' }, '^bar' => sub { say 'bar' }, ); my $what; my $re = join '\|', map { "$_(?{\$what = '$_'})" } keys %h; if ('foo' =~ m/$re/o) { $h{$what}->(); }` [download] But be sure to read the warnings in perlre. Note that this requires the keys not to contain ' or \, but you could easily give them identifier labels instead, or properly escape $_ before interpolation. (5.010 only required for say). Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re: Optimize a pluggable regex matching mechanism by AnomalousMonk (Archbishop) on Nov 09, 2009 at 23:04 UTC
Consider also Regexp::Assemble and the following code example taken (more or less) verbatim from it: `>perl -wMstrict -MRegexp::Assemble -le "my $r = Regexp::Assemble->new->track(1)->add(qw(foo? bar{2} [Rr]at)); for my $w (qw(this food is rather barren)) { if ($w =~ /$r/) { print qq{'$w' matched by /}, $r->source($^R), q{/}; } else { print qq{'$w' unmatched}; } } " 'this' unmatched 'food' matched by /foo?/ 'is' unmatched 'rather' matched by /[Rr]at/ 'barren' matched by /bar{2}/` [download] I haven't checked, but my guess is that this or a related method or methods from the module will give you a fairly well optimized solution.	[reply] [d/l]
Re: Optimize a pluggable regex matching mechanism (/o practically deprecated) by LanX (Saint) on Nov 10, 2009 at 08:25 UTC
For regexes with variables in them, it rebuilds the machine every time it uses the regex. IIRC it's long since this is not the case anymore. Perl's checking a regex-cache to see if it has already compiled the pattern. That's why the /o flag is almost useless now (Beware: it can produce ugly side effects if not used with care) (Though I couldn't find it stated in the perldocs :-( ... I think I should have read it in Friedl's regex "bible") As Ikegami already suggested, stick with qr is certainly the cleanest approach to control compilation time, even with nested regexes, so better just forget /o! š Cheers Rolf UPDATE: (1) or better only try to remember it as a possible optimization AFTER profiling shows a bottleneck!	[reply]
Re^2: Optimize a pluggable regex matching mechanism (/o practically deprecated) by ikegami (Patriarch) on Nov 10, 2009 at 16:02 UTC
Perl's checking a regex-cache to see if it has already compiled the pattern. No, the cache is not nearly so general. Each op simply remembers the last pattern it compiled. `>perl -Mre=debug -e"/$_/ for qw( a a b a )" 2>&1 \| find "Compiling" Compiling REx "a" Compiling REx "b" Compiling REx "a"` [download] `>perl -Mre=debug -e"/a/; /a/;" 2>&1 \| find "Compiling" Compiling REx "a" Compiling REx "a"` [download] Nothing's stopping you from making your own such cache, though. `my $compiled_pat = $compiled_pats{$pat} \|\|= qr/$pat/;` [download]	[reply] [d/l] [select]
Re^3: Optimize a pluggable regex matching mechanism (/o practically deprecated) by LanX (Saint) on Nov 10, 2009 at 16:52 UTC
For completeness, the following code stresses out that the last compilation per code position is reused if possible. `perl -Mre=debug -e '@x=qw/a b a a c a/; while (($a,$b,@x)=@x) { /$a/;/ +$b/ }' 2>&1\|grep Compiling Compiling REx "a" Compiling REx "b" Compiling REx "a" Compiling REx "c"` [download] Each op remembers the last pattern it compiled. Yeah, that's kind of what I expected (was too lazy for details), but a general cache would be indeed real overkill. Both ways there is not much use left for /o, IMHO the perlre and perlretut should be updated with at least footnote... Cheers Rolf	[reply] [d/l]