in reply to Perl regex limitations

Yeah, at least some of the time, * becomes {0,32767}.

use re 'debug'; my $braces = qr/(?<braces>\{ ([^\{\}] | (?&braces))*? \} )/x; Compiling REx "(?<braces>\{ ([^\{\}] | (?&braces))*? \} )" Final program: 1: OPEN1 'braces' (3) 3: EXACT <{> (5) 5: MINMOD (6) 6: CURLYX[0] {0,32767} (29) 8: OPEN2 (10) 10: BRANCH (22) 11: ANYOF[\0-z|~-\377][{unicode_all}] (26) 22: BRANCH (FAIL) 23: GOSUB1[-22] (26) 26: CLOSE2 (28) 28: WHILEM[2/1] (0) 29: NOTHING (30) 30: EXACT <}> (32) 32: CLOSE1 'braces' (34) 34: END (0) anchored "{" at 0 floating "}" at 1..2147483647 (checking floating) mi +nlen 2 Freeing REx: "(?<braces>\{ ([^\{\}] | (?&braces))*? \} )"

- tye        

Replies are listed 'Best First'.
Re^2: Perl regex limitations (*)
by tye (Sage) on Aug 06, 2014 at 04:13 UTC

    A single character can make a huge difference in how large of a match your regex is able to make (it likely also makes your regex more efficient):

    use re 'debug'; my $braces = qr/(?<braces>\{ ([^\{\}]* | (?&braces))*? \} )/x; # ^ Compiling REx "(?<braces>\{ ([^\{\}]* | (?&braces))*? \} )" Final program: 1: OPEN1 'braces' (3) 3: EXACT <{> (5) 5: MINMOD (6) 6: CURLYX[0] {0,32767} (30) 8: OPEN2 (10) 10: BRANCH (23) 11: STAR (27) 12: ANYOF[\0-z|~-\377][{unicode_all}] (0) 23: BRANCH (FAIL) 24: GOSUB1[-23] (27) 27: CLOSE2 (29) 29: WHILEM[2/1] (0) 30: NOTHING (31) 31: EXACT <}> (33) 33: CLOSE1 'braces' (35) 35: END (0) anchored "{" at 0 floating "}" at 1..2147483647 (checking floating) mi +nlen 2 Freeing REx: "(?<braces>\{ ([^\{\}]* | (?&braces))*? \} )"

    You still have an instance of {0,32767}, but it means you can match 32K instances of braces that are separated by non-brace runs where each run can be huge (note how the new * was translated to STAR w/o 32K mentioned and not to CURLYX).

    Your prior regex would have to recurse into the 'braces' construct for every single new character matched. Since it would only do that upto 32K times, you could only match about 32KB of text.

    Update: I should have added a + not a *.

    - tye        

Re^2: Perl regex limitations (32k)
by oiskuu (Hermit) on Aug 06, 2014 at 12:21 UTC

    Warnings enabled, the 32k problem should have manifested itself:

    Complex regular subexpression recursion limit (32766) exceeded
    To work around this problem (a warning remains), one may double the repeat quantifiers, e.g.:
    my $braces = qr/(?<braces> { (?: (?: [^{}] | (?&braces) )*+ )* } )/x;

    perlre has the following example on handling nested parens:

    my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ \+ \s+ bar $parens/x) { # do something here... }

      Thanks for all the replies! I tried adding the following test to see if I can get it working (as I said, I have now rewritten it to do the parsing by hand, but I would like to figure it out for next time):
      $braces = qr/(?<braces>\{ ([^\{\}]++ | (?&braces))*+ \} )/x; if ($h =~ s/class \s+ (?<class_name> \w+) \s* (\: (\s* \w*)? \s* (?<an +cestor> \w+))? \s* $braces//x) { print "hi\n"; }
      The match is never made however... The code I'm matching against is a C++ header file, and I'm looking for class definitions, formatted like this :
      class MyClassName : public MyAncestorClass { ... ... };
      This expression works fine for smaller classes or for this same problem class if there are no braces inside the class definition. In this class however, I have the following declarations (C++ Builder code...) :
      __property TNotifyEvent OnMonaDocSaveToDB = {read=FOnMonaDocSaveToDB +, write=FOnMonaDocSaveToDB};
      If I remove that property declaration, it matches, if I leave it, it fails... Jonathan
      I just checked, there is no warning when I execute my script using "perl -w"...

        On the off chance that you've stumbled upon an undiscovered bug, the considerate route to take is to provide a testcase. This may involve some tedious work reducing a self-contained test unit to a minimum. That, of course, after you've verified the behavior with the latest perl.