samtregar has asked for the wisdom of the Perl Monks concerning the following question:

Shalom, Monks. I've searched for an answer to this question, but my searches have come up empty. I've found a bug in my XML::Validator::Schema module and I think it's due to a limit in Perl's regular expression engine, but my attempts to confirm this have failed. XML::Validator::Schema uses regexes to validate XML Schema content models. I'm getting a failure when validating a file containing around 160,000 elements. The failure is happening on the 32,768th element, which is certainly a suspicious number indeed!

Inside XML::Validator::Schema, something like this is happening:

my $s = join('', (("<object>") x 160_000)) die("Failed validation!") unless $s =~ /^(?:<object>)+$/;

I say "something like this" because it's a lot more complicated. The regex is dynamicly built and compliled with qr// and the string is built up from the incoming XML stream.

When I try to verify the limit by hand I can't get the same results:

$ perl -e '$s = join("", (("<object>") x 160_000)); print "ok\n" if + $s =~ /^(?:<object>)+$/' ok

What am I missing? Why does XML::Validator::Schema suffer from a limit of 32,768 repetitions on + when the same thing works on the command line?

Thanks,
-sam

PS: I'm working with Perl 5.6.1, if that makes a difference.

Replies are listed 'Best First'.
Re: Is there a hard limit on + in a regex?
by Roy Johnson (Monsignor) on Jul 09, 2004 at 18:46 UTC
    Understanding Regular Expressions says
    Items governed by * (and *?) are optional not only once, but repeatedly forever (well, to be pedantic, Perl currently has an internal limit of 32K repeats for parenthetical items).

    We're not really tightening our belts, it just feels that way because we're getting fatter.
      Alright, that sucks but I guess there's no changing it. So why did my command-line attempt to verify this fail?

      -sam

        It depends which of the nodes CURLYX, WHILEM, CURLYM, CURLYN, CURLY, STAR, PLUS were used by the regular expression compiler. This is a measure of how "naughty" you expression is so using anything besides (?: exact string )+ is edging into a realm where + has a limit.
      Uh, just pointing out that the article you're referring to is talking about perl 5.002. The author even talks about running it on his 175 MHz DEC Alpha and having the code take 4 hours.
        True, but I haven't found any perldelta that says it has changed.

        We're not really tightening our belts, it just feels that way because we're getting fatter.
Re: Is there a hard limit on + in a regex?
by tinita (Parson) on Jul 09, 2004 at 18:39 UTC
    what i get with 5.8.3 on linux:
    Complex regular subexpression recursion limit (32766) exceeded at regex_limit.pl line 5.
    are you using two different versions here maybe?
    update:
    with your exact command line example i'm getting "ok" as well, sorry...
      No, both the XML::Validator::Schema failure and the command-line success are happening with the same Perl, a custom-built copy of 5.6.1 on i686-linux.

      -sam

        are you really sure that is the exact regex? by making the regex just one little bit more complex i can get the error i stated above. e.g. (?:<object+>)+ instead of (?:<object>)+
Re: Is there a hard limit on + in a regex?
by BrowserUk (Patriarch) on Jul 09, 2004 at 19:35 UTC

    You said that the real regex is more complex. Does it include an explicit quantifier?

    + seems happy to match 160,000 repetitions on either 5.6.1 or 5.8.4, but it doesn't like explicit quantifier greater than 32766.

    P:\test>perl5.6.1 -Mstrict -wle "my $s= '<object>' x 160_000; print 'Ok' if $s =~ m[^(?:<object>)+$]" Ok P:\test>perl5.8.4 -Mstrict -wle "my $s= '<object>' x 160_000; print 'Ok' if $s =~ m[^(?:<object>)+$]" Ok P:\test>perl5.8.4 -Mstrict -wle "my $s= '<object>' x 160_000; print 'Ok' if $s =~ m[^(?:<object>){1600 +00}$]" Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/^(?:<object>){ <-- HERE 160000}$/ at -e line 1 +.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon

      Your first two examples set the quantifier as {1,REG_INFTY} because they were simple enough to allow it (exact matches, no capturing, etc). You cannot explicitly name quantifiers higher than REG_INFTY and now I quote some comments from the source.

      The default size for REG_INFTY is I16_MAX, which is the same as SHORT_MAX (see perl.h). Unfortunately I16 isn't necessarily 16 bits (see handy.h). On the Cray C90, sizeof(short)==4 and hence I16_MAX is ((1<<31)-1), while on the Cray T90, sizeof(short)==8 and I16_MAX is ((1<<63)-1). To limit stack growth to reasonable sizes, supply a smaller default. --Andy Dougherty 11 June 1998

      Perhaps you can change this default locally?

        Dunno. I grepped for I16_MAX and found it is set to INT16_MAX. So I grepped for that, and couldn't find where that is defined?

        Anyhow, it appears that what I am seeing on Win32 is of little interest.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
      No, it doesn't.

      -sam

Re: Is there a hard limit on + in a regex?
by Joost (Canon) on Jul 09, 2004 at 19:31 UTC
    Hmmm, both this code:
    #!/usr/bin/perl my $s = "<object>" x 1600_000; #made higher to be sure die("Failed validation!") unless $s =~ /^(?:<object>)+$/; print "ok\n";
    And your command line example run fine (well, both print "ok") on perl, v5.8.3 built for i386-linux-thread-multi - that's the debian "testing" package, and perl, v5.8.4 built for i386-linux-thread-multi (that's debian "unstable")

      Same for me. It's the equivalent construct inside XML::Validator::Schema which isn't working past 32,768 iterations.

      -sam

        Can you give a bit of working code (edit: I mean, code that give that error) that demonstrates the problem (using XML::Validator::Schema if you have to)?

        As noted here, there are limits, but maybe you can work around it.

        Also, these paragraphs in http://search.cpan.org/src/NWCLARK/perl-5.8.4/hints/machten.sh claim that you can compile perl with higher limits on regex repeats, irrespective of the architecture defaults:

        # Set reg_infty -- the maximum allowable number of repeats in regular # expressions such as /a{1,$max_repeats}/, and the maximum number of # times /a*/ will match. Setting this too high without having a stack # large enough to accommodate deep recursion in the regular expression # engine allows perl to crash your Mac due to stack overrun if it # encounters a pathological regular expression. The default is a # compromise between capability and required stack size (see below). # You may override the default value from the Configure command-line # like this: # # Configure -Dreg_infty=16368 ... reg_infty=${reg_infty:-2047} # If you want to have many perl processes active simultaneously -- # processing CGI forms -- for example, you should opt for a small stac +k. # For safety, you should set reg_infty no larger than the correspondin +g # value given in this table: # # Stack size reg_infty value supported # ---------- ------------------------- # 128k 2**8-1 (256) # 256k 2**9-1 (511) # 512k 2**10-1 (1023) # 1M 2**11-1 (2047) # ... # 16M 2**15-1 (32767) (perl's default value) # This script selects a safe stack size based on the value of reg_inft +y # specified above. However, you may choose to take a risk and set # stack size lower: pathological regular expressions are rare in real- +world # programs. But be aware that, if perl does encounter one, it WILL # crash your system. Do not set stack size lower than 96k unless # you want perl's installation tests ( make test ) to crash your syste +m. # # You may override the default value from the Configure command-line # by specifying the required size in kilobytes like this: # # Configure -Dstack_size=96
Re: Is there a hard limit on + in a regex?
by hv (Prior) on Jul 10, 2004 at 13:16 UTC

    Iteration of complex patterns is achieved through recursion. The recursion limit is REG_INFTY, typically set to 32K when compiling but sometimes set lower on platforms with small default stack sizes.

    The intention is that this restriction will be removed in a future version of perl, by replacing the use of the C stack with an independent resizable stack. Then you'll be able to iterate as far as you have memory available.

    (By way of example, the limit is currently too high on linux, as evidenced by core dumps:

    zen% perl -wle '$_="abc" x 33000; print "ok" if /^(a|bc)+$/' Segmentation fault (core dumped) zen%
    .)

    Hugo