in reply to Re: Surprisingly poor regex performance
in thread Surprisingly poor regex performance

Ah, thanks, that makes a big difference (though it's still slower than using index/rindex). Do you have any explanation for why? My understanding was that ^ meant pretty much the same thing as \n?.

dragonchild's code is even faster, in some cases by a factor of 3.

Replies are listed 'Best First'.
Re^3: Surprisingly poor regex performance
by dragonchild (Archbishop) on Dec 13, 2004 at 17:59 UTC
    Well, ^ doesn't mean the same as \n? at all. Like not even close.

    According to Programming Perl 3rd edition pages 150 and 159, ^, when used with the /m modifier, means to match after embedded newlines or the beginning of the string.

    \n? means something akin to possibly match a newline, but maybe not. That doesn't provide the engine with any understanding of where to match. Had you said something like /\n(.*$pat.*\n)/, you would have been better off (except for not matching the first line). Of course, with /m, you should be able to do /^(.*$pat.*)$/omsg and it should be about as efficient as my regex.

    With regexes, it's always better to be as explicit as possible. This will allow the engine to make a number of optimizations. Some of those optimizations, as you have found out, can mean the difference between 2500 seconds and 23 seconds.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      You're right of course; I shouldn't have said meant the same as; I meant would have the same effect as.

      Still, I think it's accurate to say that ^ means very nearly the same as your example: (?:\A|\n), so it's still quite surprising to me that yours is so much faster.

        I'm actually really curious about that, as well. I /msg'ed japhy asking if he'd pop in and help us out. My benchmarking shows a 15x speedup using (?:\A|\n) over using ^ with the /m modifier.

        Oh - taking away the /m modifier when using (?:\A|\n) results in potentially something like a 1% speedup. I guess randomly adding modifiers is bad. :-)

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re^3: Surprisingly poor regex performance
by Joost (Canon) on Dec 13, 2004 at 17:22 UTC
    Good question. I guess using ^ to match newlines is probably just not very well optimized in the engine.

    The usual trick to make regexes faster is to be as explicit as you can be (like spelling out the newlines) and it helped here. I don't know enough about perl's regex internals to really explain why it works as well as it does in this case.