chuckd has asked for the wisdom of the Perl Monks concerning the following question:

Why does:
$string =~ s/\^s//g;
$string =~ s/\s$//g;
$string =~ s/\s+$//g;
run faster than:
$string =~ s/\^s|\s$|\s+$/g;
on long strings?
you would think that because the second one is using a short circuit operator it would be faster because it only has to read through the string once instead of three times, but it is not.

Replies are listed 'Best First'.
Re: question about reg exp engine
by GrandFather (Saint) on Aug 03, 2008 at 23:38 UTC

    One issue is that the result is clouded by an extra bogus match. Only one of the 'tail' matches is required. Adding the second one is really bad news for the single regex variant. Choosing to 'nibble' (\s) instead of swallowing whole (\s+) does the single regex a huge disservice too. Consider the following benchmark:

    Rate sExNibble singleNibble mExNibble multi multiNib +ble single sExNibble 1795/s -- -74% -99% -99% - +99% -99% singleNibble 6881/s 283% -- -95% -95% - +97% -98% mExNibble 134665/s 7403% 1857% -- -5% - +49% -60% multi 142315/s 7830% 1968% 6% -- - +46% -58% multiNibble 263146/s 14562% 3724% 95% 85% + -- -22% single 338640/s 18769% 4822% 151% 138% +29% --

    Perl reduces RSI - it saves typing
Re: question about reg exp engine
by broomduster (Priest) on Aug 03, 2008 at 21:58 UTC
    Alternation ('|') in a regex is not a short-circuit operator. It's not "substitute for the first alternative found", rather it's "substitute for any alternatives found."

    Some other points: I think you want ^\s not \^s, and your alternation form does not terminate the replacement part of the substitution (i.e., you're missing the last '/').

      yes I have a typo in the post. It shoud be ^\s, but that still doesn't give me an answer. Why does it run slower than running all three substitutions on different lines?
        Why does it run slower than running all three substitutions on different lines?
        Because the first three are all optimisable; they are all explicitly anchored to the beginning or end of the string, and the regex engine is smart enough to try the match only at the beginning or end of the string, respectively.

        The combined pattern is too complex to be optimised, so the engine naively tries matching at every position in the (long) string.

        Dave.

        yes I have a typo in the post. It shoud be ^\s, but that still doesn't give me an answer.
        If you fix that typo and then run Benchmarks, I think you will see that they are about the same speed. I see speed differences of 0-3% with the typo fixed, and 20-25% with the typo in place.... probably because some optimization is possible when the regex says "beginning of string" and not '^' in an arbitrary place in the string.
Re: question about reg exp engine
by eosbuddy (Scribe) on Aug 03, 2008 at 23:19 UTC
    Not sure about my argument - but a plausible explanation (after reading Mastering RegEx on the mechanics of ex proc ..) is that your second choice with the `|' OR part actually needs to do backtracking i.e. regex tries to match the first expression and if it has to do that, it follows through your first choice, looks for a match, backtracks, takes your 2nd choice and looks for a match and so on. Since the backtracking is absent for the choices on three separate lines, I would think that would explain the speed... my 2 cents :-) Update: I am wrong after all ... nonetheless, testing your expression against your first two choices might mean the same speed the last one with the + sign might be the cause of the latency in the chain. If I see the string, I might be able to experiment with it to shed more light.
Re: question about reg exp engine
by linuxer (Curate) on Aug 04, 2008 at 08:48 UTC

    Your first three regex do not need the modifier /g. You are using anchors, so the /g modifier is useless here. If you leave it out, the work will be done faster...

    Your second regex (one whitespace at the end of line/string) is redundant, because it is already included in the third regex (one or more whitespace(s) at the end of line/string).

    You really want to remove all whitespaces from the end of line, but only one from the beginning?

    $string =~ s/^\S+//; # assumed to remove *all* whitespaces from the b +eginning $string =~ s/\S+$//; # assumed to remove *all* whitespaces from the e +nd $string =~ s/^\s+|\s+$//g; # /g needed here

    Benchmark:

    #!/usr/bin/perl # vi:ts=4 sw=4 et: use strict; use warnings; use Benchmark qw( cmpthese ); my $results = cmpthese( -3, { 'r2/g' => sub { my $string = ( ' ' x 1000 ) . 'x' . ( ' ' . 1000 ); $string =~ s/^\s+//g; $string =~ s/\s+$//g; }, 'r2' => sub { my $string = ( ' ' x 1000 ) . 'x' . ( ' ' . 1000 ); $string =~ s/^\s+//; $string =~ s/\s+$//; }, 'r1' => sub { my $string = ( ' ' x 1000 ) . 'x' . ( ' ' . 1000 ); $string =~ s/^\s+|\s+$//g; } } );

    Result:

    Rate r1 r2/g r2 r1 135605/s -- -19% -25% r2/g 168081/s 24% -- -7% r2 181494/s 34% 8% --

    If your text in $string is multilined, you should check perlre for the /m modifier.

    update: fixed typo.