cxreg has asked for the wisdom of the Perl Monks concerning the following question:

I encountered a strange regexp which was causing my mod_perl program to thrash, and have created a minimal case which causes it to do so:
my $x = 1; while(1) { my $text = "a" x $x; $text =~ s/(\w+){$x}//; print "x = $x ; OK\n"; $x++; }
As you can see, it starts to get really slow around 20 and gets exponentially worse from there. Hast thou any idea wtf is going on?

Replies are listed 'Best First'.
Re: regexp hang?
by Aristotle (Chancellor) on Aug 22, 2002 at 05:18 UTC

    It's backtracking like crazy. The \w+ gobbles up as many characters as it can, but then the {$x} quantifier forces it to backtrack one character in order to form two (\w+) sequences. That does not yield enough sequences to satisfy the {$x}, so it backtracks another character. The newly freed character is then gobbled up by the second group. We still don't have {$x} groups. Now the second group backtracks and concedes a character to allow for the matching of a third group. That's still not $x groups. So we backtrack.. and so on.. and so forth..

    Make that \w{$x}, which is what you obviously wanted, and you'll see it works fine.

    Makeshifts last the longest.

      actually, thats not what I wanted. The original regexp was like this:
      $intro =~ s/((\w+\W*){200})(.*)/$1/si;
      Which is "keep first 200 words and discard the rest", and I was assigning $3 to another variable. Any idea why it thrashes?

        No, that doesn't match "200 words"; you'd want \W+ not \W* for that. If you had at least 200 words then it will match rather quickly. Otherwise it can take a very long time backtracking and having \W* match zero-length bits in the middles of words trying to work back until it can match 200 partial words.

                - tye (but my friends call me "Tye")
        This is odd. I have no immediate idea; you avoided what the camel book demonstrates as the a*a* pitfall, I think, since you ask for delimiting \W characters. I'm not sure, but I think forcing the \W to match using +, and not accepting zero matches using *, would fix things. I also propose you anchor the pattern. You can also catch a free speed bonus by not capturing the inner brackets (note that this puts the rest into $2 rather than $3). s/^((?:\w+\W+){200})(.*)/$1/si Update: right, tye++ confirms my intuition.

        Makeshifts last the longest.

        tye has already told you why you could see slowness caused by backtracking. Also, I would write "seperate $string in two parts: the first 200 words ($intro) and the rest ($rest)" like this:

        ($rest = $string) =~ s/\A((?:\w+\W+){200})//s and $intro = $1;

        (Assuming your string starts with a word character and you don't mind the extra non-word character(s) after the 200th word.)

        — Arien

        Edit: If you don't care about changing the value of $string you could obviously leave out the copy to $rest.

      Ah, hrm. I do see what you mean, though this is surprisingly the first time I've ever encountered anything like this in 5 years of happy Perling. I'll have to rethink my approach. Thanks for your reply!