in reply to Efficient regex matching with qr//; Can I do better?

... There are in the order of 20,000 texts and 50,000-100,000 patterns,

In general it's much faster to match one large regex than many regexes many times.

That means you could try to assemble a regex of $x original regexes into one.

Now it seems you have to know which regex matched, which means you have to distinguish them. In perl 5.10.0 or above you can use named captures. If you can't require such a new perl version, you can try something like this instead:

our $which_matched; sub assemble_regex { my %regexes = @_; return join '|', map { q[(?:$regexes{$_})(?{\$which_matched='$_'})]} keys %regexes; }

This assumes that keys in %regexes don't contain single quotes and trailing backslashes.

If many of the patterns are constant strings, consider upgrading to perl 5.10.0 - it greatly speeds up matching of many constant alternatives.

Replies are listed 'Best First'.
Re^2: Efficient regex matching with qr//; Can I do better?
by kruppy (Initiate) on Jul 14, 2008 at 12:17 UTC
    Ok, so I managed to get version 5.10 on to my local machine at least. But I am not really sure how one would construct such large, aggregated, regexes. I didn't get any further than this:
    if ($text =~ /(?<p1>\b$pattern1\b)|(?<p2>\b$pattern2\b)/) { foreach my $foo (keys %+) {print $foo.','.$+{$foo}."\n";} }
    but then of course I only get one of the matches (even if both match).

    Could you please give me a little example of such a construct? The documentation didn't help me much, I'm afraid.

    Many thanks, Ola
      If you want the alternations to match at the same starting position, you might be able to fiddle something together with look-ahead groups (not sure it works), but generally that doesn't work very well.

      You could try to match once, reset pos to the previous starting position, remove the regex that caused the match and retry again. But I don't think that's very efficient.

      If you don't want to match at the same position, you can use the /g modifier in a while loop to match multiple times.

        So you're suggesting something like
        while ($text =~ /(\b$pattern1\b)|(\b$pattern2\b)/g) { # Do something with $+ }
        , or? But then I'm unable to match at the same position. I have probably misunderstood you somehow because this solution doesn't need named captures as far as I can tell.

        IF this is what you actually did mean, what do you think about some iterative solution where you remove the matched patterns and match again until you find no more? I have absolutely no idea whether that would speed up things in the end, though...