in reply to Re: regexp hang?
in thread regexp hang?

actually, thats not what I wanted. The original regexp was like this:
$intro =~ s/((\w+\W*){200})(.*)/$1/si;
Which is "keep first 200 words and discard the rest", and I was assigning $3 to another variable. Any idea why it thrashes?

Replies are listed 'Best First'.
(tye)Re: regexp hang?
by tye (Sage) on Aug 22, 2002 at 05:54 UTC

    No, that doesn't match "200 words"; you'd want \W+ not \W* for that. If you had at least 200 words then it will match rather quickly. Otherwise it can take a very long time backtracking and having \W* match zero-length bits in the middles of words trying to work back until it can match 200 partial words.

            - tye (but my friends call me "Tye")
Re^3: regexp hang?
by Aristotle (Chancellor) on Aug 22, 2002 at 05:42 UTC
    This is odd. I have no immediate idea; you avoided what the camel book demonstrates as the a*a* pitfall, I think, since you ask for delimiting \W characters. I'm not sure, but I think forcing the \W to match using +, and not accepting zero matches using *, would fix things. I also propose you anchor the pattern. You can also catch a free speed bonus by not capturing the inner brackets (note that this puts the rest into $2 rather than $3). s/^((?:\w+\W+){200})(.*)/$1/si Update: right, tye++ confirms my intuition.

    Makeshifts last the longest.

Re(3): regexp hang?
by Arien (Pilgrim) on Aug 22, 2002 at 06:02 UTC

    tye has already told you why you could see slowness caused by backtracking. Also, I would write "seperate $string in two parts: the first 200 words ($intro) and the rest ($rest)" like this:

    ($rest = $string) =~ s/\A((?:\w+\W+){200})//s and $intro = $1;

    (Assuming your string starts with a word character and you don't mind the extra non-word character(s) after the 200th word.)

    — Arien

    Edit: If you don't care about changing the value of $string you could obviously leave out the copy to $rest.