in reply to Re^2: Efficient regex matching with qr//; Can I do better?
in thread Efficient regex matching with qr//; Can I do better?

If you want the alternations to match at the same starting position, you might be able to fiddle something together with look-ahead groups (not sure it works), but generally that doesn't work very well.

You could try to match once, reset pos to the previous starting position, remove the regex that caused the match and retry again. But I don't think that's very efficient.

If you don't want to match at the same position, you can use the /g modifier in a while loop to match multiple times.

Replies are listed 'Best First'.
Re^4: Efficient regex matching with qr//; Can I do better?
by kruppy (Initiate) on Jul 14, 2008 at 14:20 UTC
    So you're suggesting something like
    while ($text =~ /(\b$pattern1\b)|(\b$pattern2\b)/g) { # Do something with $+ }
    , or? But then I'm unable to match at the same position. I have probably misunderstood you somehow because this solution doesn't need named captures as far as I can tell.

    IF this is what you actually did mean, what do you think about some iterative solution where you remove the matched patterns and match again until you find no more? I have absolutely no idea whether that would speed up things in the end, though...
      I have probably misunderstood you somehow because this solution doesn't need named captures as far as I can tell.

      It only needs named captures if you want to know which one of the regexes matched.

      But then I'm unable to match at the same position.

      Yes, that's true. If you really need it, and only have a relatively small number of matches, you can do something along these lines:

      my $re = assemble_regex(\%hash); my $old_pos = pos; while (m/\G$re/){ $pos = $old_pos; # reset match position # ... extract name of matched regex in $matched_re here ... delete $hash{$matched_re}; # re-generate regex, without the one that previously matched: $re = assemble_regex(\%hash); }

      Note that it'll be rather expansive to build the regex many times, so only do this if you have a relatively low number of matches.

        "It only needs named captures if you want to know which one of the regexes matched."

        Sorry if I'm slow, but can't I just as well find that out with the $+ operator as in my previous post? Or is that less efficient than using these named captures?