in reply to regexp hang?

It's backtracking like crazy. The \w+ gobbles up as many characters as it can, but then the {$x} quantifier forces it to backtrack one character in order to form two (\w+) sequences. That does not yield enough sequences to satisfy the {$x}, so it backtracks another character. The newly freed character is then gobbled up by the second group. We still don't have {$x} groups. Now the second group backtracks and concedes a character to allow for the matching of a third group. That's still not $x groups. So we backtrack.. and so on.. and so forth..

Make that \w{$x}, which is what you obviously wanted, and you'll see it works fine.

Makeshifts last the longest.

Replies are listed 'Best First'.
Re: Re: regexp hang?
by cxreg (Novice) on Aug 22, 2002 at 05:22 UTC
    actually, thats not what I wanted. The original regexp was like this:
    $intro =~ s/((\w+\W*){200})(.*)/$1/si;
    Which is "keep first 200 words and discard the rest", and I was assigning $3 to another variable. Any idea why it thrashes?

      No, that doesn't match "200 words"; you'd want \W+ not \W* for that. If you had at least 200 words then it will match rather quickly. Otherwise it can take a very long time backtracking and having \W* match zero-length bits in the middles of words trying to work back until it can match 200 partial words.

              - tye (but my friends call me "Tye")
      This is odd. I have no immediate idea; you avoided what the camel book demonstrates as the a*a* pitfall, I think, since you ask for delimiting \W characters. I'm not sure, but I think forcing the \W to match using +, and not accepting zero matches using *, would fix things. I also propose you anchor the pattern. You can also catch a free speed bonus by not capturing the inner brackets (note that this puts the rest into $2 rather than $3). s/^((?:\w+\W+){200})(.*)/$1/si Update: right, tye++ confirms my intuition.

      Makeshifts last the longest.

      tye has already told you why you could see slowness caused by backtracking. Also, I would write "seperate $string in two parts: the first 200 words ($intro) and the rest ($rest)" like this:

      ($rest = $string) =~ s/\A((?:\w+\W+){200})//s and $intro = $1;

      (Assuming your string starts with a word character and you don't mind the extra non-word character(s) after the 200th word.)

      — Arien

      Edit: If you don't care about changing the value of $string you could obviously leave out the copy to $rest.

Re: Re: regexp hang?
by cxreg (Novice) on Aug 22, 2002 at 05:27 UTC
    Ah, hrm. I do see what you mean, though this is surprisingly the first time I've ever encountered anything like this in 5 years of happy Perling. I'll have to rethink my approach. Thanks for your reply!