jonneve has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Are there limitations to the regex matching capabilities in Perl? I came across a case today where a string wouldn't match, and I couldn't figure out why. I finally tried removing part of the source file and it matched! So it seems as though the size of the source file and/or the size of the match is limited somehow. The match would have been about 50 kb. The expression was the following :
s/class \s+ (?<class_name> \w+) \s* (\: (\s* \w*)? \s* (?<ancestor> \w+))? \s* $braces //x
where $braces was defined as :
$braces = qr/(?<braces>\{ ([^\{\}] | (?&braces))*? \} )/x;
It matched correctly if I simplified it to the following:
s/class \s+ (?<class_name> \w+) \s* (\: (\s* \w*)? \s* (?<ancestor> \w+))? \s* \{ (?<class_body> [\s\S]+ ) \} \s*; //xm
That wasn't acceptable however, because by removing the recursion, I was no longer able to match the braces correctly and it therefore failed if there were two sets of braces following each other (it greedily matched right up to the last closing brace. So basically, my question is what the limitations are that caused this problem. I realize that I'm pushing things to the limit and that at this point, it's probably better, just from a performance point of view, to split things up into smaller expressions by doing a bit of basic parsing by hand first (that's what I ended up doing, and it works fine). I did waste a good 2 hours trying to figure out why the regex wasn't working though, so I'd kind of like to know for next time what the limitations are... Thanks in advance, Jonathan

Replies are listed 'Best First'.
Re: Perl regex limitations (32k)
by tye (Sage) on Aug 06, 2014 at 00:36 UTC

    Yeah, at least some of the time, * becomes {0,32767}.

    use re 'debug'; my $braces = qr/(?<braces>\{ ([^\{\}] | (?&braces))*? \} )/x; Compiling REx "(?<braces>\{ ([^\{\}] | (?&braces))*? \} )" Final program: 1: OPEN1 'braces' (3) 3: EXACT <{> (5) 5: MINMOD (6) 6: CURLYX[0] {0,32767} (29) 8: OPEN2 (10) 10: BRANCH (22) 11: ANYOF[\0-z|~-\377][{unicode_all}] (26) 22: BRANCH (FAIL) 23: GOSUB1[-22] (26) 26: CLOSE2 (28) 28: WHILEM[2/1] (0) 29: NOTHING (30) 30: EXACT <}> (32) 32: CLOSE1 'braces' (34) 34: END (0) anchored "{" at 0 floating "}" at 1..2147483647 (checking floating) mi +nlen 2 Freeing REx: "(?<braces>\{ ([^\{\}] | (?&braces))*? \} )"

    - tye        

      A single character can make a huge difference in how large of a match your regex is able to make (it likely also makes your regex more efficient):

      use re 'debug'; my $braces = qr/(?<braces>\{ ([^\{\}]* | (?&braces))*? \} )/x; # ^ Compiling REx "(?<braces>\{ ([^\{\}]* | (?&braces))*? \} )" Final program: 1: OPEN1 'braces' (3) 3: EXACT <{> (5) 5: MINMOD (6) 6: CURLYX[0] {0,32767} (30) 8: OPEN2 (10) 10: BRANCH (23) 11: STAR (27) 12: ANYOF[\0-z|~-\377][{unicode_all}] (0) 23: BRANCH (FAIL) 24: GOSUB1[-23] (27) 27: CLOSE2 (29) 29: WHILEM[2/1] (0) 30: NOTHING (31) 31: EXACT <}> (33) 33: CLOSE1 'braces' (35) 35: END (0) anchored "{" at 0 floating "}" at 1..2147483647 (checking floating) mi +nlen 2 Freeing REx: "(?<braces>\{ ([^\{\}]* | (?&braces))*? \} )"

      You still have an instance of {0,32767}, but it means you can match 32K instances of braces that are separated by non-brace runs where each run can be huge (note how the new * was translated to STAR w/o 32K mentioned and not to CURLYX).

      Your prior regex would have to recurse into the 'braces' construct for every single new character matched. Since it would only do that upto 32K times, you could only match about 32KB of text.

      Update: I should have added a + not a *.

      - tye        

      Warnings enabled, the 32k problem should have manifested itself:

      Complex regular subexpression recursion limit (32766) exceeded
      To work around this problem (a warning remains), one may double the repeat quantifiers, e.g.:
      my $braces = qr/(?<braces> { (?: (?: [^{}] | (?&braces) )*+ )* } )/x;

      perlre has the following example on handling nested parens:

      my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ \+ \s+ bar $parens/x) { # do something here... }

        Thanks for all the replies! I tried adding the following test to see if I can get it working (as I said, I have now rewritten it to do the parsing by hand, but I would like to figure it out for next time):
        $braces = qr/(?<braces>\{ ([^\{\}]++ | (?&braces))*+ \} )/x; if ($h =~ s/class \s+ (?<class_name> \w+) \s* (\: (\s* \w*)? \s* (?<an +cestor> \w+))? \s* $braces//x) { print "hi\n"; }
        The match is never made however... The code I'm matching against is a C++ header file, and I'm looking for class definitions, formatted like this :
        class MyClassName : public MyAncestorClass { ... ... };
        This expression works fine for smaller classes or for this same problem class if there are no braces inside the class definition. In this class however, I have the following declarations (C++ Builder code...) :
        __property TNotifyEvent OnMonaDocSaveToDB = {read=FOnMonaDocSaveToDB +, write=FOnMonaDocSaveToDB};
        If I remove that property declaration, it matches, if I leave it, it fails... Jonathan
        I just checked, there is no warning when I execute my script using "perl -w"...
Re: Perl regex limitations
by Laurent_R (Canon) on Aug 05, 2014 at 20:54 UTC
    I do not know if there is any limit, but, in terms of size of the match, the following Perl one-liner has a regex that happily matches the full text (4.17 MB) of a French translation of the Bible:
    $ cat Bible.txt | perl -e 'local $/; $c = <>; $d = $1 if $c =~/(Gen.+ +tous!)/s; print length $d;' 4167072 $ wc Bible.txt 32324 714470 4167075 Bible.txt
Re: Perl regex limitations
by oiskuu (Hermit) on Aug 05, 2014 at 22:39 UTC

    With complicated regular expressions like that, limiting the amount of back-tracking becomes an important concern. This may be addressed with possessive quantifiers: (?>...), *+, etc. Consider the following version:

    use warnings; $braces = qr/(?<braces> { (?: [^{}] | (?&braces) )*+ } )/x; s/class \s+ (?<class_name> \w+) \s* (\: (\s* \w*)? \s* (?<ancestor> \w+))?+ \s* $braces //x;
    Note the possessive *+ and ?+. But this probably does not solve your problem. Are you sure the braces were properly balanced in your input? Could there have been some escaped or quoted brace(s) that failed to match up?

Re: Perl regex limitations (m/\G.../gc)
by tye (Sage) on Aug 06, 2014 at 16:03 UTC

    If I were trying to do this, I'd start by matching constructs that could contain braces that are not significant (quotes, comments). And I'd match piecemeal rather than trying to build one big, complex regex.

    # load source code into $_ while( m{['"{}]|//|/*|\bclass\s+}g ) { my $c = substr( $_, pos($_)-1, 1 ); if( "'" eq $c ) { m{\G(?:[^\\']+|\\.)*'}gc or fail( \$_, "Unclosed char" ); } elsif( '"' eq $c ) { m{\G(?:[^\\"]+|\\.)*"}gc or fail( \$_, "Unclosed string" ); } elsif( '/' eq $c ) { m{\G[^\n]*}gc; } elsif( '*' eq $c ) { m{*/}gc or fail( \$_, "Unclosed comment" ); } elsif( '{' eq $c ) { $nest++; } elsif( '}' eq $c ) { $nest--; } elsif( 's' eq $c ) { my $p = pos(); m{\G\w+}gc or fail( \$_, "'class' not followed by ID" ); my $name = substr( $_, $p, pos()-$p ); if( m[\G\s*:\s*(?:\w+\s+)?(\w+)\s+{]gc ) { my $ancestor = $1; $nest++; } elsif( ! m[\s*{}gc ) { fail( \$_, "No { after 'class $name'" ); } } }

    - tye        

      It may also be worth abandoning pure regular expressions in favor of a grammar-based parser. Parse::RecDescent comes to mind immediately, but there's also Regexp:Grammars if you'd like to stick to (spiced-up) regular expression after all.