Here's one for regex ninjas.

Let's look at this example, a simple grep:

my $pat = shift; while (<>) { print if /$pat/; }

As you certainly all know, Perl likes to build the machine for a regex when it compiles the program. For regexes with variables in them, it rebuilds the machine every time it uses the regex. So here, Perl rebuilds the machine every time through the loop, which makes this program really slow.

You certainly also know that this program can be optimized with the o-flag:

my $pat = shift; while (<>) { print if /$pat/o; }

This tells Perl that $pat never changes. Perl will compile the machine for /$pat/ the first time it uses the regex and then remember it for later.

The situation is a bit more complicated if you want to match more than one pattern, e.g.:

my @pats = ('fo*', 'ba.', 'w+3'); while (<>) { foreach $pat (@pats) { print if /$pat/; } }

Obviously the /../o trick won't work here, because then only the first pattern would be compiled. But this program can be made much faster by joining all patterns into one:

my @pats = ('fo*', 'ba.', 'w+3'); my $pat = join('|', @pats); while (<>) { print if /$pat/o; }

So far, so good. Now for my problem. Let's asume we have a little plugin system and plugins can register functions to be called when a given regex matches a line.

Our program will store pairs of </pattern/i, funcref> in a hash %patterns and then basically do something like this:

for my $line (@lines) { for my $pattern (keys(%patterns)) { if (my @params = ($line =~ $pattern)) { my $func = $patterns{$pattern}; if (defined($func)) { $func->(@params); } } } }

Again, this is very slow, because Perl needs to rebuild the machine for the regex every time through the loop, for every line. But the problem is: I can't use the trick shown above (join all patterns into one string with '|') because then I can't decide which pattern in the string matched and so I don't know which function to call.

I tried a lot of different things without success, so now I hope for the expertise of the Perl Monks. How could this mechanism be optimized? Is there any way at all? Perhaps my approach is total bullshit and there's a much better one to do this? I'm looking forward to your ideas!

(Examples taken from http://perl.plover.com/Regex/article.html


In reply to Optimize a pluggable regex matching mechanism by dredd

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.