Re: Is there a hard limit on + in a regex?
by Roy Johnson (Monsignor) on Jul 09, 2004 at 18:46 UTC
|
Items governed by * (and *?) are optional not only once,
but repeatedly forever (well, to be pedantic, Perl currently
has an internal limit of 32K repeats for parenthetical items).
We're not really tightening our belts, it just feels that way because we're getting fatter.
| [reply] [d/l] |
|
|
Alright, that sucks but I guess there's no changing it. So why did my command-line attempt to verify this fail?
-sam
| [reply] |
|
|
It depends which of the nodes CURLYX, WHILEM, CURLYM, CURLYN, CURLY, STAR, PLUS were used by the regular expression compiler. This is a measure of how "naughty" you expression is so using anything besides (?: exact string )+ is edging into a realm where + has a limit.
| [reply] |
|
|
Uh, just pointing out that the article you're referring to is talking about perl 5.002. The author even talks about running it on his 175 MHz DEC Alpha and having the code take 4 hours.
| [reply] |
|
|
| [reply] |
Re: Is there a hard limit on + in a regex?
by tinita (Parson) on Jul 09, 2004 at 18:39 UTC
|
what i get with 5.8.3 on linux:
Complex regular subexpression recursion limit (32766) exceeded at regex_limit.pl line 5.
are you using two different versions here maybe?
update:
with your exact command line example i'm getting "ok" as well, sorry... | [reply] [d/l] |
|
|
No, both the XML::Validator::Schema failure and the command-line success are happening with the same Perl, a custom-built copy of 5.6.1 on i686-linux.
-sam
| [reply] |
|
|
are you really sure that is the exact regex? by making the regex just one little bit
more complex i can get the error i stated above. e.g. (?:<object+>)+ instead
of (?:<object>)+
| [reply] [d/l] [select] |
|
|
|
|
Re: Is there a hard limit on + in a regex?
by BrowserUk (Patriarch) on Jul 09, 2004 at 19:35 UTC
|
You said that the real regex is more complex. Does it include an explicit quantifier?
+ seems happy to match 160,000 repetitions on either 5.6.1 or 5.8.4, but it doesn't like explicit quantifier greater than 32766.
P:\test>perl5.6.1 -Mstrict -wle
"my $s= '<object>' x 160_000; print 'Ok' if $s =~ m[^(?:<object>)+$]"
Ok
P:\test>perl5.8.4 -Mstrict -wle
"my $s= '<object>' x 160_000; print 'Ok' if $s =~ m[^(?:<object>)+$]"
Ok
P:\test>perl5.8.4 -Mstrict -wle
"my $s= '<object>' x 160_000; print 'Ok' if $s =~ m[^(?:<object>){1600
+00}$]"
Quantifier in {,} bigger than 32766 in regex;
marked by <-- HERE in m/^(?:<object>){ <-- HERE 160000}$/ at -e line 1
+.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
| [reply] [d/l] |
|
|
Your first two examples set the quantifier as {1,REG_INFTY} because they were simple enough to allow it (exact matches, no capturing, etc). You cannot explicitly name quantifiers higher than REG_INFTY and now I quote some comments from the source. The default size for REG_INFTY is I16_MAX, which is the same as SHORT_MAX (see perl.h). Unfortunately I16 isn't necessarily 16 bits (see handy.h). On the Cray C90, sizeof(short)==4 and hence I16_MAX is ((1<<31)-1), while on the Cray T90, sizeof(short)==8 and I16_MAX is ((1<<63)-1). To limit stack growth to reasonable sizes, supply a smaller default. --Andy Dougherty 11 June 1998 Perhaps you can change this default locally?
| [reply] [d/l] |
|
|
Dunno. I grepped for I16_MAX and found it is set to INT16_MAX. So I grepped for that, and couldn't find where that is defined?
Anyhow, it appears that what I am seeing on Win32 is of little interest.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
| [reply] |
|
|
| [reply] |
Re: Is there a hard limit on + in a regex?
by Joost (Canon) on Jul 09, 2004 at 19:31 UTC
|
#!/usr/bin/perl
my $s = "<object>" x 1600_000; #made higher to be sure
die("Failed validation!") unless $s =~ /^(?:<object>)+$/;
print "ok\n";
And your command line example run fine (well, both print "ok") on perl, v5.8.3 built for i386-linux-thread-multi - that's the debian "testing" package, and perl, v5.8.4 built for i386-linux-thread-multi (that's debian "unstable")
| [reply] [d/l] [select] |
|
|
Same for me. It's the equivalent construct inside XML::Validator::Schema which isn't working past 32,768 iterations.
-sam
| [reply] |
|
|
Can you give a bit of working code (edit: I mean, code that give that error) that demonstrates the problem (using XML::Validator::Schema if you have to)?
As noted here, there are limits, but maybe you can work around it.
Also, these paragraphs in http://search.cpan.org/src/NWCLARK/perl-5.8.4/hints/machten.sh
claim that you can compile perl with higher limits on regex repeats, irrespective of the architecture defaults:
# Set reg_infty -- the maximum allowable number of repeats in regular
# expressions such as /a{1,$max_repeats}/, and the maximum number of
# times /a*/ will match. Setting this too high without having a stack
# large enough to accommodate deep recursion in the regular expression
# engine allows perl to crash your Mac due to stack overrun if it
# encounters a pathological regular expression. The default is a
# compromise between capability and required stack size (see below).
# You may override the default value from the Configure command-line
# like this:
#
# Configure -Dreg_infty=16368 ...
reg_infty=${reg_infty:-2047}
# If you want to have many perl processes active simultaneously --
# processing CGI forms -- for example, you should opt for a small stac
+k.
# For safety, you should set reg_infty no larger than the correspondin
+g
# value given in this table:
#
# Stack size reg_infty value supported
# ---------- -------------------------
# 128k 2**8-1 (256)
# 256k 2**9-1 (511)
# 512k 2**10-1 (1023)
# 1M 2**11-1 (2047)
# ...
# 16M 2**15-1 (32767) (perl's default value)
# This script selects a safe stack size based on the value of reg_inft
+y
# specified above. However, you may choose to take a risk and set
# stack size lower: pathological regular expressions are rare in real-
+world
# programs. But be aware that, if perl does encounter one, it WILL
# crash your system. Do not set stack size lower than 96k unless
# you want perl's installation tests ( make test ) to crash your syste
+m.
#
# You may override the default value from the Configure command-line
# by specifying the required size in kilobytes like this:
#
# Configure -Dstack_size=96
| [reply] [d/l] |
|
|
|
|
Re: Is there a hard limit on + in a regex?
by hv (Prior) on Jul 10, 2004 at 13:16 UTC
|
Iteration of complex patterns is achieved through recursion. The recursion limit is REG_INFTY, typically set to 32K when compiling but sometimes set lower on platforms with small default stack sizes.
The intention is that this restriction will be removed in a future version of perl, by replacing the use of the C stack with an independent resizable stack. Then you'll be able to iterate as far as you have memory available.
(By way of example, the limit is currently too high on linux, as evidenced by core dumps:
zen% perl -wle '$_="abc" x 33000; print "ok" if /^(a|bc)+$/'
Segmentation fault (core dumped)
zen%
.)
Hugo | [reply] [d/l] |