flummoxer has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I can't figure out why perl crashes with the error "Substitution loop at" when a string gets to be about 1 G long, it works fine on smaller strings.
#!/usr/bin/perl -w use strict; my $string; while ( <> ) { $string .= $_; } $string =~ s/\s+//g; print length($string), "\n";
then make a file and boom
perl -e 'for ( 0 .. 100000000 ) { print "1234567890\n" }' > ! /scratch +/100m ./evince_oddness /scratch/100m Substitution loop at ./evince_oddness line 10, <> line 100000001
It works fine on like 700mb strings, so has me baffled as I'm not a regex initiate, err acolyte?

I do know it's an inefficient way to read the file, I'm using File::Slurp but wanted to simplify the problem.

Oh and it happens with 5.10.0 and 5.8.8 64bit linux and 32bit solaris vendor compiled and me compiled so I don't think I just built a wonky perl.

Replies are listed 'Best First'.
Re: problem with long strings and regex
by ig (Vicar) on Apr 02, 2009 at 07:10 UTC

    An example of where the "Substitution loop" error is emitted from pp_hot.c:

    I32 maxiters; ... strend = s + len; slen = RX_MATCH_UTF8(rx) ? utf8_length((U8*)s, (U8*)strend) : len; maxiters = 2 * slen + 10; /* We can match twice at each position, once with zero-length, second time with non-zero. */ ... if (iters++ > maxiters) DIE(aTHX_ "Substitution loop");

    So, maxiters is a 32bit signed integer, set to twice the length of the string (which will be more than the number of characters if there are UTF8 characters in the string) plus a few.

    The maximum value of an I32 is +2,147,483,647, so when your string gets to a little over 1GB in length, the maxiter variable wraps and you get nonsense.

      Wow what a great answer! Knowing there's a limit is good, knowing where I might tinker with perl source is great.
Re: problem with long strings and regex
by codeacrobat (Chaplain) on Apr 02, 2009 at 06:45 UTC
    What is the point of doing it in one blow? Memory stress testing?
    use strict; my $length=0; while(<>){ chomp; # did you want to get rid of newlines? otherwise use tr/// $length += length; } print $length, "\n";

    print+qq(\L@{[ref\&@]}@{['@'x7^'!#2/"!4']});
      My datasets are growing larger constantly and I ran into this and wondered if it was a problem in the regex or a limit to string length.
      It happened as I wanted to read in a file of newline separated strings that represent a contiguous entity, dna, and I thought the quickest way was
      use File::Slurp qw( slurp ); my $str = slurp('file'); $str =~ s/\n//g;
      Maybe there's a faster way?
Re: problem with long strings and regex
by Anonymous Monk on Apr 02, 2009 at 06:38 UTC