in reply to Recursive sub-pattern spectacularly slow, what's wrong? (me, or just this use case, or a bug?)

(now it's not related to the PWC task), w.r.t. "horrible amount of backtracking", I think these 2 patterns

/(.(?1)(*F))/ /(.*)(*F)/

should backtrack about the same amount. Yet one is 4 orders of magnitude slower (with sample sizes and hardware I'm using) and clearly exponential. The other stays linear. Sorry I didn't re-run it with more "REPEATS" and am leaving it untidy, it's just "mad science" for laughs of course i.e. I now see recursive pattern wasn't a good idea to use for the task.

1.4 +-------------------------------------------------+ | + + + + + + | | A| | | | | | A| | A | | | | A | | | | A | 1.2 |-+ +-| | A | | A | | | | A | | | | A | | A | | | | A | | | 1 |-+ A +-| | A | | | | A | | A | | | | A | | A | | | | A | | A | 0.8 |-+ +-| | A | | A | | | | A | | A | | | | AA | | | | A | 0.6 |-+ A +-| | | | AA | | | | A | | A | | A | | | | A | | A | | | 0.4 |-+ A +-| | A | | A | | AA | | | | AA | | A | | | | AAA | | A | | A | 0.2 |-+ A +-| | A | | A | | AAA | | AA | | A | | AA | | AA | | AAA | | AAA | |A + + + + + + | 0 +-------------------------------------------------+ 1000 2000 3000 4000 5000 6000 7000 8000 1.2 +-------------------------------------------------+ | + + + + + + | | | | | | | | A| | A| | A | | A | | A | | A | | A | | A | 1 |-+ AA +-| | A | | A | | A | | A | | A | | AA | | | | AA | | A | | A | | A | 0.8 |-+ A +-| | A | | AA | | | | A | | AA | | A | | A | | AA | | A | | A | | A | | A | 0.6 |-+ A +-| | A | | A | | A | | A | | A | | A | | AA | | A | | A | | A | | A | | A | 0.4 |-+ A +-| | A | | A | | A | | A | | A | | A | | A | | A | | A | | AA | | | 0.2 |-+ A +-| | A | | A | |A | | | | | | | | | | | | | | | | | | + + + + + + | 0 +-------------------------------------------------+ 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 8e+07

####

use strict; use warnings; use Benchmark qw/ timeit :hireswallclock /; use Chart::Gnuplot; use File::Spec::Functions 'catfile'; use File::Basename 'dirname'; my $GP = catfile dirname( $^X ), '../../c/bin/gnuplot.exe'; die unless -x $GP; # assume Strawberry Perl use constant REPEATS => 1; STDOUT-> autoflush( 1 ); { my ( @x, @y, @datasets ); for my $x ( map 1e6 * $_, 10 .. 80 ) { print "$x "; my $s = 'a' x $x; my $t = timeit REPEATS, sub { $s =~ /(.*)(*F)/ }; my $y = $t-> [ 1 ] / $t-> [ -1 ]; push @x, $x; push @y, $y; } print "\n\n"; push @datasets, Chart::Gnuplot::DataSet-> new( xdata => \@x, ydata => \@y, style => 'points', ); my $chart = Chart::Gnuplot-> new( gnuplot => $GP, terminal => 'dumb size 60, 80', ); $chart-> plot2d( @datasets ); } { my ( @x, @y, @datasets ); for my $x ( map 1e2 * $_, 10 .. 80 ) { print "$x "; my $s = 'a' x $x; my $t = timeit REPEATS, sub { $s =~ /(.(?1)(*F))/ }; my $y = $t-> [ 1 ] / $t-> [ -1 ]; push @x, $x; push @y, $y; } print "\n\n"; push @datasets, Chart::Gnuplot::DataSet-> new( xdata => \@x, ydata => \@y, style => 'points', ); my $chart = Chart::Gnuplot-> new( gnuplot => $GP, terminal => 'dumb size 60, 80', ); $chart-> plot2d( @datasets ); }
  • Comment on Re: Recursive sub-pattern spectacularly slow, what's wrong? (me, or just this use case, or a bug?)
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: Recursive sub-pattern spectacularly slow, what's wrong? (me, or just this use case, or a bug?)
by LanX (Saint) on Sep 26, 2025 at 14:39 UTC
    I think these 2 patterns

    /(.(?1)(*F))/ /(.*)(*F)/

    should backtrack about the same amount.

    Nope, seems like the first has quadratic and the second one linear complexity.
    use v5.12; use warnings; say my $az = join "","a".."z"; $az x= 100; my $c; my $show=0; sub COUNT { $c++; say "Step $c matches '",$2 // "","'" if $show; } $show=0; for my $l (1..24) { say "=== Length Str: ",$l; my $str = substr $az,0,$l; $c = 0; $str =~ /( (.) (?{COUNT}) (?1) (*FAIL))/x; say "$l : $c = ", $l*($l+1)/2; $c=0; $str =~ /( (?: (.) (?{COUNT}) )* (*FAIL) )/x; say "$l : $c = ", 2 *$l-1; }
    output
    # perl /home/lanx/perl/pm/recursive_regex.pl abcdefghijklmnopqrstuvwxyz === Length Str: 1 1 : 0 = 1 1 : 1 = 1 === Length Str: 2 2 : 3 = 3 2 : 3 = 3 === Length Str: 3 3 : 6 = 6 3 : 5 = 5 === Length Str: 4 4 : 10 = 10 4 : 7 = 7 === Length Str: 5 5 : 15 = 15 5 : 9 = 9 === Length Str: 6 6 : 21 = 21 6 : 11 = 11 === Length Str: 7 7 : 28 = 28 7 : 13 = 13 === Length Str: 8 8 : 36 = 36 8 : 15 = 15 === Length Str: 9 9 : 45 = 45 9 : 17 = 17 === Length Str: 10 10 : 55 = 55 10 : 19 = 19 === Length Str: 11 11 : 66 = 66 11 : 21 = 21 === Length Str: 12 12 : 78 = 78 12 : 23 = 23 === Length Str: 13 13 : 91 = 91 13 : 25 = 25 === Length Str: 14 14 : 105 = 105 14 : 27 = 27 === Length Str: 15 15 : 120 = 120 15 : 29 = 29 === Length Str: 16 16 : 136 = 136 16 : 31 = 31 === Length Str: 17 17 : 153 = 153 17 : 33 = 33 === Length Str: 18 18 : 171 = 171 18 : 35 = 35 === Length Str: 19 19 : 190 = 190 19 : 37 = 37 === Length Str: 20 20 : 210 = 210 20 : 39 = 39 === Length Str: 21 21 : 231 = 231 21 : 41 = 41 === Length Str: 22 22 : 253 = 253 22 : 43 = 43 === Length Str: 23 23 : 276 = 276 23 : 45 = 45 === Length Str: 24 24 : 300 = 300 24 : 47 = 47
    Reason: Perl regex are highly pre-optimized by using heuristics.

    HTH

    - LanX (Log In Timed out)

      > Reason: Perl regex are highly pre-optimized by using heuristics.

      Actually it's a post-optimization.

      After the first full backtrack (from "a" to end/FAIL) the 2nd regex reports

      • "Detected a super-linear match, switching on caching..."

      and runs from "b" to FAIL.

      After that the cached results are later used to see that no match is possible and backtracking is stopped. That's why the formula is n+(n-1)

      So it's not that the 1st regex with the recursion is very bad¹, it's just that the second is very clever.

      Hint: use re "debug" to see the details.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

      ¹) there still might be a memory leak somewhere, but who is using recursion to this extremes anyway?

        I was curious what a "detected a super-linear match" means and asked Duck-AI. Explanation below. (LLMs are good in condensing knowledge as long as you review the output like I did. see also https://stackoverflow.com/questions/12841970/how-can-i-recognize-an-evil-regex)

        Super-linear Match A super-linear match in the context of regular expressions refers to a situation where the time complexity of matching a pattern against a string exceeds linear time, often approaching quadratic or worse. This typically occurs due to certain patterns in the regular expression that can lead to excessive backtracking.

        Characteristics of Super-Linear Matches

        • Backtracking: When a regex engine encounters ambiguous patterns, it may try multiple paths to find a match. This can lead to exponential time complexity in the worst case, especially with nested quantifiers (e.g., .* followed by another .*).
        • Nested Quantifiers: Patterns like (a|b)*c can cause super-linear behavior when matched against strings that contain many as and bs, as the engine may explore all combinations of a and b before concluding that there is no match.
        • Long Input Strings: The longer the input string, the more pronounced the super-linear behavior can become, especially if the regex is poorly constructed.

        Example

        Consider the regex pattern ^(a|b)*c$ and the input string aaaaaaab. The regex engine may try to match the string by considering all possible combinations of a and b, leading to a significant increase in processing time.

        Conclusion

        To avoid super-linear matches, it's essential to design regular expressions carefully, minimizing the use of nested quantifiers and ambiguous patterns. This can help ensure that the matching process remains efficient and operates within linear time complexity.

        Basically its trying to detect that the complexity is more ("super") than linear and turns on caching.

        Since the situation from starting from "c" without prior match is the same like with a prior "b" or "ab" the cached result of a FAIL can be used to shortcut the search when starting from "c", "d" and so on.

        Of course you could also apply the same mechanism with recursive regexes, but as I said the use cases are rare.

        Cheers

        - LanX (LogIn timing out again)

Re^2: Recursive sub-pattern spectacularly slow, what's wrong? (me, or just this use case, or a bug?)
by LanX (Saint) on Sep 24, 2025 at 23:09 UTC
    I could think of various reasons:
    • recursive patterns are not optimized for going deep
    • but rather meant for parsing structures like code or data with at best hundreds of nesting levels
    • like with subs every recursion means storing a certain amount of data on the stack
    • branching doesn't happen that often when parsing a grammar, bc there is only a small amount of opening "brackets" to follow deeper
    • "normal" regexes are optimized by heuristics to take shortcuts, i.e. .* is not naively implemented going step by step
    My suggestion would be that you use re 'debug' to see whats happening.

    Especially I'd try to make sure that it's really the same amount of backtracking in both cases

    update

    FWIW: my OS is killing the process when I attempt to have 1e7 recursions.

    DB<3> ("a"x 1e6) =~ / (. (?: (?1) | ) ) /x; say length $1 1000000 DB<4> ("a"x 1e7) =~ /(.*)/; say length $1 10000000 DB<5> ("a"x 1e7) =~ / (. (?: (?1) | ) ) /x; say length $1 Killed

    My guess: memory problems on the stack.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery