in reply to Why does a Perl 5.6 regex run a lot slower on Perl 5.8?

I am the person to blame. I made the change in the regex engine that is causing the problem you're facing. Let me explain:
The conclusion is that all regular expressions written like this: $text =~ /(.*?)<whatever>/ take a thousand times more on 5.8.0. The same expressions written as $text =~ /^(.*?)<whatever>/ which obviously means the same thing (look for the first occurence of <whatever> and save the text preceding it in the corresponding variables) has the same performance implications across these two versions.
Sadly, that is not true, and that is exactly what I had to change in the source of perl. You say that /(.*)X/ and /^(.*)X/, but that is a half-truth. Consider this case: "xxyyyRyyy" =~ /(.*)R\1/ If, as you state, the leading ^ is implied, the regex fails, because "xxyyy" cannot be found after the "R" as my regex requires. Only by not anchoring that regex can it ever match ($1 is "yyy").

There is no "easy" way to fix this problem in the source of perl; you have to explicitly state the anchor yourself. The reason is that perl has no way of knowing whether or not you'll end up using what you captured as a backreference, so anchoring has an unknown effect. The problem is not only when the .* is captured, either; any capturing in the regex causes a problem.

(The case of "abc\ndef1" =~ /.*\d/ is already handled by the engine so as not to fail. It would fail if the regex were treated as /^.*\d/, but the engine makes it (?m:^) if necessary.)

_____________________________________________________
Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Replies are listed 'Best First'.
Re^2: The Deceiver
by perldeveloper (Scribe) on Aug 13, 2004 at 14:04 UTC
    Thank you for your quick answer, and thanks again for taking the time to explain. Although I agree that the two regular expressions have different meanings, the real question here is why Perl 5.6.1 is 500-1,000 times faster than Perl 5.8.0 on the same regular expression -- this is my real query. Am I to assume that Perl 5.6.1 did not properly parse certain regular expressions and Perl 5.8.0 now does? I just tried your regular expressions and they yielded the same results under both versions. How unstable is my previous code, if new versions can make it obsolete in performance, as if encouraging not to upgrade.
      #reg.pl $s = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyRRRRyyyy\n" x 500; $n = 0; $n++ while ($s =~ /(.*?)RRRR\1/sg); print "$n matches\n";
      
      time ~/bin/perl5.8.0 reg.pl 
      500 matches
      
      real    0m4.836s
      user    0m4.800s
      sys     0m0.010s
      
      time ~/bin/perl5.6.1 reg.pl 
      0 matches
      
      real    0m0.020s
      user    0m0.020s
      sys     0m0.000s
      
      So, in fact, you are complaining that a bug got fixed. The problem is that these are extremely inefficient regular expressions because they involve a lot of backtracking. I recommend reading Mastering Regular Expressions for a detailed explanation.
        That's very good to hear. What's not good is that code that relied on nothing like backreferencing regexps got squashed in the upgrade process. Like japhy guessed, the regexp failed to match, but wouldn't it make sense even for a /(.*)TEXT\1/ regexp to first look for /TEXT/ and then worry about getting the appropriate group match (be it greedy or reluctant)? This slowing down is a terrible shock some people might get (including me) when moving old code to new code. But on another note, I do agree I'm a long way from mastering regular expressions.
      I'm not entirely sure why the regexes were so much slower, unless they just never could actually match. In that circumstance, /.*FAIL/ would be a lot slower than /^.*FAIL/.
      _____________________________________________________
      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

        And since perldeveloper removes the "whatever"s from the string the regexp will fail in the last iteration. So I would not be surprised if most of the wasted time was in the last iteration :-)

        Jenda
        Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
           -- Rick Osborne