in reply to Alternations and anchors

The regex engine has many optimizations and heuristics to speed things up, you are just falling back to "normal" speed.

> Is there some weirdness with alternations and anchors?

It's a feature called "trie optimization"¹ which only works for literal characters and can't optimize meta symbols like anchors.

> (I could rewrite ^aaa|^bbb|^ccc as ^(aaa|bbb|ccc) , but it might be that only some of the inputs are anchored.)

So make two or-lists one with preceding anchor, one without and or them. Both will be trie-optimized and this should be max speed again.

Like:

((?:^(?:aaa|bbb|ccc))|(?:ddd|eee|ccc))

*untested!!!*

(?:...) is for grouping without matching capturing.

(Not sure if the one around the anchored part is necessary... I don't think so, just playing safe)

Cheers Rolf
(addicted to the Perl Programming Language :)
see Wikisyntax for the Monastery

¹) compare Re^2: Looking for a cleaner regex ( trie since 5.10 ! )

Update

Tested in perldebugger

perl -de0 ... DB<11> @matches = ( "aaa x ddd y ccc" =~ /(^ (?:aaa|bbb|ccc) | (?:dd +d|eee|ccc) )/xg ); DB<12> print "@matches" aaa ddd ccc DB<13>

Replies are listed 'Best First'.
Re^2: Alternations and anchors
by jwkrahn (Abbot) on Mar 31, 2025 at 01:41 UTC
    (?:...) is for grouping without matching.

    (?:...) is for grouping without capturing.

    Naked blocks are fun! -- Randal L. Schwartz, Perl hacker
Re^2: Alternations and anchors (trie optimization)
by Chuma (Scribe) on Apr 02, 2025 at 16:33 UTC

    Thanks! I guess that explains it. I'm not sure why ((?:^(?:aaa|bbb|ccc))|(?:ddd|eee|ccc)) can be optimised but not (?:^aaa|ddd), but it is what it is.

      > I'm not sure why ((?:^(?:aaa|bbb|ccc))|(?:ddd|eee|ccc)) can be optimised but not (?:^aaa|ddd),

      As I said:

      Because "aaa" is a string of literal characters but "^aaa" isn't. "^" is a meta symbol for anchoring. The literal character is "\^", but that's not what you want.

      If you need better explanation how trie works, please follow the links I provided.

      The long answer

      It certainly could be implemented for surrounding anchors too, just by using the workaround I showed.

      But it's normally a bargain between code complexity and trade-off.

      Perl's source is already suffering from covering too many edge cases which are only of interest for a small minority of users. And they shout the loudest when backwards compatibility is broken.

      At the same time the code is getting increasingly complicated to maintain.

      For instance: you could volunteer to implement a solution which creates multiple trie alterations surrounded by different anchors. (Not trivial to test)

      Then someone with a commit bit has to decide if it's worth the resulting trouble to test and maintain that code in eternity.

      In the end it's strategically easier to provide a CPAN module covering this edge case.

      Usage would show if it's of wider interest or just pleasing you and a handful of others.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        Usage would show if it's of wider interest or just pleasing you and a handful of others.

        Without knowing more about the OPs specific use case, i certainly don't know if this attempt at optimization is actually required or if we're dealing with an X/Y problem here.

        For example, if the input dataset is refered often, but hardly ever changes, caching the parsed results might decrease wait time orders of magnitude more in the long run than trying to optimize the parsing time. If new or changed data is also not used immediately, throw in a bit of pre-computing, and we're basically down to the time it requires to read a file from disk (and even then there are ways to optimize it, for example by pre-caching in RAM).

        PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
        Also check out my sisters artwork and my weekly webcomics