Chuma has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I'm trying to search for several regexes in some long files. To speed things up, I tried first checking a combined regex, to see if any of them matches the line. Like so:

my $comb=join('|',@ARGV); while($line=<$infile>){ if($line=~$comb){ for $target(@ARGV){ if($line=~$target){ # do a thing }}}}

This seems to speed things up, at least when the regexes are just plain words. But: If I try input regexes which are anchored ("^word"), then suddenly it's much slower! Is there some weirdness with alternations and anchors? Or did I make some obvious mistake?

(I could rewrite ^aaa|^bbb|^ccc as ^(aaa|bbb|ccc), but it might be that only some of the inputs are anchored.)

Replies are listed 'Best First'.
Re: Alternations and anchors
by GrandFather (Saint) on Mar 30, 2025 at 23:56 UTC

    I'll let others focus on bench marking and foibles of the regex engine. I note though that you do more work than needed. Consider:

    use warnings; use strict; my $fileText = <<TEXT; Dear monks, I'm trying to search for several regexes in some long files. To speed +things up, I tried first checking a combined regex, to see if any of them matches + the line. This seems to speed things up, at least when the regexes are just plai +n words. But: If I try input regexes which are anchored ("^word"), then suddenl +y it's much slower! Is there some weirdness with alternations and anchors? Or + did I make some obvious mistake? (I could rewrite ^aaa|^bbb|^ccc as ^(aaa|bbb|ccc), but it might be tha +t only some of the inputs are anchored.) Like so: TEXT my @argList = qw(regexes that ^I); my $comb = join ('|', @argList); print ">>> OP version\n"; open my $infile, "<", \$fileText; while (my $line = <$infile>) { next if $line !~ $comb; for my $target(@argList) { next if $line !~ $target; print "Matched '$target' in: $line"; } } print "\n>>> Grandfather version\n"; open $infile, "<", \$fileText; while (my $line = <$infile>) { print "Matched '$1' in: $line" while $line =~ /($comb)/g; }

    Prints:

    >>> OP version Matched 'regexes' in: I'm trying to search for several regexes in some + long files. To speed things up, Matched '^I' in: I'm trying to search for several regexes in some long + files. To speed things up, Matched '^I' in: I tried first checking a combined regex, to see if an +y of them matches the line. Matched 'regexes' in: This seems to speed things up, at least when the + regexes are just plain words. Matched 'regexes' in: But: If I try input regexes which are anchored ( +"^word"), then suddenly it's Matched 'that' in: (I could rewrite ^aaa|^bbb|^ccc as ^(aaa|bbb|ccc), +but it might be that only >>> Grandfather version Matched 'I' in: I'm trying to search for several regexes in some long +files. To speed things up, Matched 'regexes' in: I'm trying to search for several regexes in some + long files. To speed things up, Matched 'I' in: I tried first checking a combined regex, to see if any + of them matches the line. Matched 'regexes' in: This seems to speed things up, at least when the + regexes are just plain words. Matched 'regexes' in: But: If I try input regexes which are anchored ( +"^word"), then suddenly it's Matched 'that' in: (I could rewrite ^aaa|^bbb|^ccc as ^(aaa|bbb|ccc), +but it might be that only
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Alternations and anchors (trie optimization)
by LanX (Saint) on Mar 31, 2025 at 01:16 UTC
    The regex engine has many optimizations and heuristics to speed things up, you are just falling back to "normal" speed.

    > Is there some weirdness with alternations and anchors?

    It's a feature called "trie optimization"¹ which only works for literal characters and can't optimize meta symbols like anchors.

    > (I could rewrite ^aaa|^bbb|^ccc as ^(aaa|bbb|ccc) , but it might be that only some of the inputs are anchored.)

    So make two or-lists one with preceding anchor, one without and or them. Both will be trie-optimized and this should be max speed again.

    Like:

    ((?:^(?:aaa|bbb|ccc))|(?:ddd|eee|ccc))

    *untested!!!*

    (?:...) is for grouping without matching capturing.

    (Not sure if the one around the anchored part is necessary... I don't think so, just playing safe)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

    ¹) compare Re^2: Looking for a cleaner regex ( trie since 5.10 ! )

    Update

    Tested in perldebugger

    perl -de0 ... DB<11> @matches = ( "aaa x ddd y ccc" =~ /(^ (?:aaa|bbb|ccc) | (?:dd +d|eee|ccc) )/xg ); DB<12> print "@matches" aaa ddd ccc DB<13>

      (?:...) is for grouping without matching.

      (?:...) is for grouping without capturing.

      Naked blocks are fun! -- Randal L. Schwartz, Perl hacker

      Thanks! I guess that explains it. I'm not sure why ((?:^(?:aaa|bbb|ccc))|(?:ddd|eee|ccc)) can be optimised but not (?:^aaa|ddd), but it is what it is.

        > I'm not sure why ((?:^(?:aaa|bbb|ccc))|(?:ddd|eee|ccc)) can be optimised but not (?:^aaa|ddd),

        As I said:

        Because "aaa" is a string of literal characters but "^aaa" isn't. "^" is a meta symbol for anchoring. The literal character is "\^", but that's not what you want.

        If you need better explanation how trie works, please follow the links I provided.

        The long answer

        It certainly could be implemented for surrounding anchors too, just by using the workaround I showed.

        But it's normally a bargain between code complexity and trade-off.

        Perl's source is already suffering from covering too many edge cases which are only of interest for a small minority of users. And they shout the loudest when backwards compatibility is broken.

        At the same time the code is getting increasingly complicated to maintain.

        For instance: you could volunteer to implement a solution which creates multiple trie alterations surrounded by different anchors. (Not trivial to test)

        Then someone with a commit bit has to decide if it's worth the resulting trouble to test and maintain that code in eternity.

        In the end it's strategically easier to provide a CPAN module covering this edge case.

        Usage would show if it's of wider interest or just pleasing you and a handful of others.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery