Alternations and anchors

Chuma has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Alternations and anchors by GrandFather (Saint) on Mar 30, 2025 at 23:56 UTC
I'll let others focus on bench marking and foibles of the regex engine. I note though that you do more work than needed. Consider: use warnings; use strict; my $fileText = <<TEXT; Dear monks, I'm trying to search for several regexes in some long files. To speed +things up, I tried first checking a combined regex, to see if any of them matches + the line. This seems to speed things up, at least when the regexes are just plai +n words. But: If I try input regexes which are anchored ("^word"), then suddenl +y it's much slower! Is there some weirdness with alternations and anchors? Or + did I make some obvious mistake? (I could rewrite ^aaa\|^bbb\|^ccc as ^(aaa\|bbb\|ccc), but it might be tha +t only some of the inputs are anchored.) Like so: TEXT my @argList = qw(regexes that ^I); my $comb = join ('\|', @argList); print ">>> OP version\n"; open my $infile, "<", \$fileText; while (my $line = <$infile>) { next if $line !~ $comb; for my $target(@argList) { next if $line !~ $target; print "Matched '$target' in: $line"; } } print "\n>>> Grandfather version\n"; open $infile, "<", \$fileText; while (my $line = <$infile>) { print "Matched '$1' in: $line" while $line =~ /($comb)/g; } [download] Prints: >>> OP version Matched 'regexes' in: I'm trying to search for several regexes in some + long files. To speed things up, Matched '^I' in: I'm trying to search for several regexes in some long + files. To speed things up, Matched '^I' in: I tried first checking a combined regex, to see if an +y of them matches the line. Matched 'regexes' in: This seems to speed things up, at least when the + regexes are just plain words. Matched 'regexes' in: But: If I try input regexes which are anchored ( +"^word"), then suddenly it's Matched 'that' in: (I could rewrite ^aaa\|^bbb\|^ccc as ^(aaa\|bbb\|ccc), +but it might be that only >>> Grandfather version Matched 'I' in: I'm trying to search for several regexes in some long +files. To speed things up, Matched 'regexes' in: I'm trying to search for several regexes in some + long files. To speed things up, Matched 'I' in: I tried first checking a combined regex, to see if any + of them matches the line. Matched 'regexes' in: This seems to speed things up, at least when the + regexes are just plain words. Matched 'regexes' in: But: If I try input regexes which are anchored ( +"^word"), then suddenly it's Matched 'that' in: (I could rewrite ^aaa\|^bbb\|^ccc as ^(aaa\|bbb\|ccc), +but it might be that only [download] Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re: Alternations and anchors (trie optimization) by LanX (Saint) on Mar 31, 2025 at 01:16 UTC
The regex engine has many optimizations and heuristics to speed things up, you are just falling back to "normal" speed. > Is there some weirdness with alternations and anchors? It's a feature called "trie optimization"š which only works for literal characters and can't optimize meta symbols like anchors. > (I could rewrite `^aaa\|^bbb\|^ccc` as `^(aaa\|bbb\|ccc)` , but it might be that only some of the inputs are anchored.) So make two or-lists one with preceding anchor, one without and or them. Both will be trie-optimized and this should be max speed again. Like: `((?:^(?:aaa\|bbb\|ccc))\|(?:ddd\|eee\|ccc))` ~~untested!!!~~ `(?:...)` is for grouping without ~~matching~~ capturing. (Not sure if the one around the anchored part is necessary... I don't think so, just playing safe) Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery} š) compare Re^2: Looking for a cleaner regex ( trie since 5.10 ! ) Update Tested in perldebugger `perl -de0 ... DB<11> @matches = ( "aaa x ddd y ccc" =~ /(^ (?:aaa\|bbb\|ccc) \| (?:dd +d\|eee\|ccc) )/xg ); DB<12> print "@matches" aaa ddd ccc DB<13>` [download]	[reply] [d/l] [select]
Re^2: Alternations and anchors by jwkrahn (Abbot) on Mar 31, 2025 at 01:41 UTC
`(?:...)` is for grouping without matching. `(?:...)` is for grouping without capturing. Naked blocks are fun! -- Randal L. Schwartz, Perl hacker	[reply]
Re^3: Alternations and anchors by LanX (Saint) on Mar 31, 2025 at 01:51 UTC
Thanks, that's what I meant. Corrected. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re^2: Alternations and anchors (trie optimization) by Chuma (Scribe) on Apr 02, 2025 at 16:33 UTC
Thanks! I guess that explains it. I'm not sure why `((?:^(?:aaa\|bbb\|ccc))\|(?:ddd\|eee\|ccc))` can be optimised but not `(?:^aaa\|ddd)`, but it is what it is.	[reply] [d/l] [select]
Re^3: Alternations and anchors (trie optimization) by LanX (Saint) on Apr 02, 2025 at 16:50 UTC
> I'm not sure why ((?:^(?:aaa\|bbb\|ccc))\|(?:ddd\|eee\|ccc)) can be optimised but not (?:^aaa\|ddd), As I said: Because "aaa" is a string of literal characters but "^aaa" isn't. "^" is a meta symbol for anchoring. The literal character is "\^", but that's not what you want. If you need better explanation how trie works, please follow the links I provided. The long answer It certainly could be implemented for surrounding anchors too, just by using the workaround I showed. But it's normally a bargain between code complexity and trade-off. Perl's source is already suffering from covering too many edge cases which are only of interest for a small minority of users. And they shout the loudest when backwards compatibility is broken. At the same time the code is getting increasingly complicated to maintain. For instance: you could volunteer to implement a solution which creates multiple trie alterations surrounded by different anchors. (Not trivial to test) Then someone with a commit bit has to decide if it's worth the resulting trouble to test and maintain that code in eternity. In the end it's strategically easier to provide a CPAN module covering this edge case. Usage would show if it's of wider interest or just pleasing you and a handful of others. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]
Re^4: Alternations and anchors (trie optimization) by cavac (Prior) on Apr 03, 2025 at 11:44 UTC
Re^5: Alternations and anchors (trie optimization) by LanX (Saint) on Apr 03, 2025 at 12:09 UTC
Some notes below your chosen depth have not been shown here

Update

The long answer