Well, in an effort to better understand Perl's regex code, I've started writing my own. It'll be much simpler, I know, but I hope to get an idea as to how such things work -- specfically, ways to optimize some regexes. (I've been telling myself that Perl's engine is so hideous, this exercise will end up being futile. ;) But I'm doing it anyway.)

So today, I came across a scary fact. I was telling someone to use two regexes for removing leading/trailing whitespace, because one regex is a slow evil thing. Here were attempts made:

s/^\s+(.*?)\s+$/$1/; # fails sometimes s/^\s*(.*?)\s*$/$1/; # succeeds, but slow s/^\s*(.*\S)\s*$/$1/; # fails sometimes s/^\s*//, s/\s*$//; # succeeds, but WHY use * ? s/^\s+//, s/\s+$//; # succeeds, but is it good?
That last example, I said, was great, because (I thought) Perl optimizes things like /\s+$/. I was (shockingly) wrong. It turns out that /\s$/ is optimized, but not the + version.

Eww...

For interested parties, this is how Perl executes s/\s+$// on a string like "a b  c   d    ".

$_ = "a b c d "; # 1, 2, 3, 4 spaces s/\s+$//; =pod A = \s+ B = $ X = fail "a b c d " AX AAX AAAX AAAA
See? It tests at all the whitespace. How ugly. Simple regexes like that should be optimized.

So I tried the sexeger approach:

($_ = reverse) =~ s/^\s+//; $_ = reverse;
That ran faster. Then I thought about the optimization of /\s$/, and tried:
1 while s/\s$//;
That was reasonably fast too -- much faster than the s/\s+$// approach, at least.

Note that these findings are important, insofaras embedded whitespace in the string -- that's the root of the problem for the s/\s+$// approach.

_____________________________________________________
Jeff japhy Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: japhy blabs about regexes (again)
by runrig (Abbot) on Jul 16, 2001 at 22:49 UTC
    I'm using perl 5.6.1, and I don't get those results. Here's the script I'm using. I was just curious about leading spaces by itself, so I threw that in, but the WHILE_RE version is definitely slower than either the SEXEGER or the SINGLE_RE version. And on the leading & trailing white space stripping, the LEADTRAIL is version is quicker than the LTSAVE version.
    #!/usr/local/bin/perl -w use strict; use Benchmark; my $str = " a b c d "; timethese(-5, { LEADING=>\&leading, LEADTRAIL=>\&lead_trail, LTSAVE=>\&lt_save, SEXEGER=>\&sexeger, SINGLE_RE=>\&single_re, WHILE_RE=>\&while_sub, }); sub leading { local $_ = $str; s/^\s+//; $_; } sub lead_trail { local $_ = $str; s/^\s+|\s+$//g; $_; } sub lt_save { local $_ = $str; s/^\s*(.*?)\s*$/$1/; $_; } sub sexeger { local $_ = reverse $str; s/^\s+//; reverse $str; } sub single_re { local $_ = $str; s/\s+$//; $_; } sub while_sub { local $_ = $str; 1 while s/\s$//; $_; } ~/tst >./tst3 Benchmark: running LEADING, LEADTRAIL, LTSAVE, SEXEGER, SINGLE_RE, WHI +LE_RE, eac h for at least 5 CPU seconds... LEADING: 5 wallclock secs ( 5.00 usr + 0.00 sys = 5.00 CPU) @ 77 +596.80/s ( n=387984) LEADTRAIL: 5 wallclock secs ( 5.22 usr + 0.00 sys = 5.22 CPU) @ 26 +504.21/s ( n=138352) LTSAVE: 5 wallclock secs ( 5.18 usr + 0.00 sys = 5.18 CPU) @ 18 +690.54/s ( n=96817) SEXEGER: 4 wallclock secs ( 5.35 usr + 0.00 sys = 5.35 CPU) @ 57 +972.52/s ( n=310153) SINGLE_RE: 5 wallclock secs ( 5.23 usr + 0.00 sys = 5.23 CPU) @ 56 +434.42/s ( n=295152) WHILE_RE: 6 wallclock secs ( 5.00 usr + 0.00 sys = 5.00 CPU) @ 36 +090.00/s ( n=180450)
    Update: I cut 'n pasted the code and output, then fixed LTSAVE, then forgot to repaste. I guess no one looked closely enough to notice that LT_SAVE was actually beating LEADTRAIL. Its all fixed now though :)
      Here's my output (from bleadperl):
      Benchmark: running F_plus, F_sexeger, F_while, P_plus, P_sexeger, P_wh +ile, each for at least 5 CPU seconds... F_plus: 6 wallclock secs ( 5.38 usr + 0.02 sys = 5.40 CPU) @ 38 +010.19/s (n=205255) F_sexeger: 6 wallclock secs ( 5.19 usr + 0.00 sys = 5.19 CPU) @ 82 +085.16/s (n=426022) F_while: 5 wallclock secs ( 5.23 usr + 0.00 sys = 5.23 CPU) @ 99 +934.23/s (n=522656) P_plus: 6 wallclock secs ( 5.22 usr + 0.00 sys = 5.22 CPU) @ 34 +659.58/s (n=180923) P_sexeger: 7 wallclock secs ( 5.11 usr + 0.00 sys = 5.11 CPU) @ 54 +039.53/s (n=276142) P_while: 6 wallclock secs ( 5.14 usr + 0.00 sys = 5.14 CPU) @ 58 +260.31/s (n=299458) Rate P_plus F_plus P_sexeger P_while F_sexeger F_whi +le P_plus 34660/s -- -9% -36% -41% -58% -6 +5% F_plus 38010/s 10% -- -30% -35% -54% -6 +2% P_sexeger 54040/s 56% 42% -- -7% -34% -4 +6% P_while 58260/s 68% 53% 8% -- -29% -4 +2% F_sexeger 82085/s 137% 116% 52% 41% -- -1 +8% F_while 99934/s 188% 163% 85% 72% 22% +--
      The F stands for "fail", and the P stands for "pass". For me, the while-approach fails AND succeeds faster than the sexeger- and plus-approaches, and sexeger fails AND succeeds faster than the plus-approach.

      And here's the code I ran.

      #!/usr/bin/perl use Benchmark 'cmpthese'; my $X = "a b c d e f g h i j k l "; my $Y = "a b c d e f g h i j k l"; cmpthese(-5, { P_while => sub { my $x = $X; 1 while $x =~ s/\s$//; }, P_plus => sub { my $x = $X; $x =~ s/\s+$//; }, P_sexeger => sub { my $x = reverse $X; $x =~ s/^\s+//; $x = reverse $x; }, F_while => sub { my $x = $Y; 1 while $x =~ s/\s$//; }, F_plus => sub { my $x = $Y; $x =~ s/\s+$//; }, F_sexeger => sub { my $x = reverse $Y; $x =~ s/^\s+//; $x = reverse $x; }, });

      _____________________________________________________
      Jeff japhy Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        That explains it. If you add a few spaces to the end of your 'passing' string, then P_while will come in last. I suppose that's because of the cost in executing the regex a few more times. So if you expect its likely that there's a few spaces to truncate, better not to use the while :)

        But its still probably a good case for optimizing regexes anchored at the end of a string.

Re: japhy blabs about regexes (again)
by srawls (Friar) on Jul 16, 2001 at 22:06 UTC
    Hmm... I find the 1 while s/\s$//; very interesting, but shocking as well. I guess I just blindly asumed that s/\s*$// was optimized.

    Anyway, I, too, am working on a regex engine; mine is coded in C. It's a slow process for me, but I'm learning a lot--I haven't tried any optimizations yet, though, I'll have to finish before I start that : )

    The 15 year old, freshman programmer,
    Stephen Rawls

Re: japhy blabs about regexes (again)
by DrZaius (Monk) on Jul 16, 2001 at 22:22 UTC
    I haven't tried any benchmarks and I am likely not to, but this little ditty comes to mind:

    s!\s*\s$!!

    Although, I can also see how it would be slower :)

      Oh, I was hoping someone would bite. ;)

      That regex is worse. You'd think by breaking \s+ down into \s*\s, you'd get the best of both worlds -- the "1 or more" meaning, and the "trailing character" optimization. No such luck. In fact, the regex does MORE work, because of backtracking. :(

      You see, Perl doesn't even see if the string has ANY trailing whitespace if you do /\s+$/. Meaning it takes a while to fail for a string like "a b c d e f g h i j k l m n o p q r s t u v w x y z". How sad.

      # try this, and be sad :( use re 'debug'; "abc def ghi" =~ /\s+$/; "abc def ghi" =~ /\s*\s$/;
      The output (if you understand it) will upset you. Well, it upset me.

      _____________________________________________________
      Jeff japhy Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: japhy blabs about regexes (again)(ot)(boo)
by boo_radley (Parson) on Jul 16, 2001 at 23:37 UTC
Re: japhy blabs about regexes (again)
by Wookie (Beadle) on Jul 17, 2001 at 16:47 UTC
    Surely all this depends on how much white space you're typically expecting. Could the following not also be more efficent depending on the data:
    if ($blah=~m/^\s/) { $blah=~s/^\s+//; }
    or indeed (thanks to MZSanford for this one):
    if (index($blah," ")) == 0 { $blah=~s/^\s+//; }
    and for trailing - use rindex :).


    UPDATE: Oops - The index solution only works for *actual* spaces (thanks Hofmator).

    That possibly gets you a little closer to the best of both worlds?
    game(Wookie,opponent) eq 'Wookie' ? undef $problem : remove_limbs(arms,opponent);
      The problem isn't leading spaces. Leading spaces are terribly easy: ^\s+ says "find the beginning of the string and one or more spaces". It's trailing spaces that are hideous. Instead of \s+$ meaning "find the end of the string, and one or more spaces before it", it means "find one or more spaces, and then try for the end of the string". That's sad of it.

      _____________________________________________________
      Jeff japhy Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: japhy blabs about regexes (again)
by MZSanford (Curate) on Jul 17, 2001 at 20:43 UTC
    Being curious as Perl people will be, i tired an rindex based solution. It unfortunatly would only work with true spaces (" "), and did not fare too well in testing, but, we all learn when things like this happen ... here is my code/output :

    Code Addition :
    P_rindex => sub { my $x = $X; while (1) { my $i = length($x); (rindex($x," ") == $i ? chop : las +t) }; }, F_rindex => sub { my $x = $Y; while (1) { my $i = length($x); (rindex($x," ") == $i ? chop : las +t) }; },


    Output :
    Benchmark: running F_plus, F_rindex, F_sexeger, F_while, P_plus, P_rin +dex,P_sexeger, P_while, each for at least 5 CPU seconds... F_plus: 6 wallclock secs ( 5.15 usr + 0.00 sys = 5.15 CPU) @ 37 +600.97/s(n=193645) F_rindex: 6 wallclock secs ( 5.45 usr + 0.00 sys = 5.45 CPU) @ 59 +183.30/s(n=322549) F_sexeger: 7 wallclock secs ( 5.24 usr + 0.00 sys = 5.24 CPU) @ 59 +187.60/s(n=310143) F_while: 7 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 69 +741.02/s(n=368930) P_plus: 5 wallclock secs ( 5.46 usr + 0.00 sys = 5.46 CPU) @ 33 +136.08/s(n=180923) P_rindex: 6 wallclock secs ( 5.31 usr + 0.00 sys = 5.31 CPU) @ 60 +743.69/s(n=322549) P_sexeger: 5 wallclock secs ( 5.36 usr + 0.00 sys = 5.36 CPU) @ 49 +557.65/s(n=265629) P_while: 6 wallclock secs ( 5.09 usr + 0.00 sys = 5.09 CPU) @ 50 +179.17/s(n=255412) Rate P_plus F_plus P_sexeger P_while F_rindex F_sexeger P_ +rindex F_while P_plus 33136/s -- -12% -33% -34% -44% -44% + -45%-52% F_plus 37601/s 13% -- -24% -25% -36% -36% + -38%-46% P_sexeger 49558/s 50% 32% -- -1% -16% -16% + -18%-29% P_while 50179/s 51% 33% 1% -- -15% -15% + -17%-28% F_rindex 59183/s 79% 57% 19% 18% -- -0% + -3%-15% F_sexeger 59188/s 79% 57% 19% 18% 0% -- + -3%-15% P_rindex 60744/s 83% 62% 23% 21% 3% 3% + ---13% F_while 69741/s 110% 85% 41% 39% 18% 18% + 15%--

    As usual, nothing fantastic, but worth the learning experiance.
    OH, a sarcasm detector, that’s really useful
      I wrote four XS functions, and one of them (used here) is called rnchrpos(STR, CHARS [, POS]). It works just like rindex(), except it searches for a character that is not one of a set of characters. It's similar to doing:
      sub rnchrpos { my ($str, $chars, $pos) = @_; $pos ||= $pos; $pos += length $str if $pos < 0; return ( substr($str, 0, $pos) =~ /[\Q$chars\E]*\z/ ? $-[0] - 1 : -1 ); }
      Except that mine is faster, since it's written in C. The results vary greatly. It comes down to knowing your string. If your string has lots of chunks of whitespace, the regular regex approach is bad. If your string has a long string of trailing whitespace, the rnchrpos() approach is bad. If there's no trailing whitespace, or very little, the while approach and the rnchrpos() are good. The sexeger approach is pretty good if the string isn't very long. If your string has a few chunks of whitespace the regular regex approach is good.

      We live in a very variable world.

      _____________________________________________________
      Jeff japhy Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Why not write your own?
by John M. Dlugosz (Monsignor) on Jul 17, 2001 at 02:12 UTC
    I found, a while back, a regex library for C++ that was mostly Perl compatible. As you point out, the code actually in Perl is rather difficult to fathem and simply grabbing that part and incorporating it into a C program is not possible. But having that kind of power in the "other" language is so lustful...

      I am writing my own engine. Hopefully, I'll gain some insights into Perl's.

      _____________________________________________________
      Jeff japhy Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: japhy blabs about regexes (again)
by coreolyn (Parson) on Jul 17, 2001 at 19:40 UTC

    I've been using s/[ ]+//g; which seems to work effectively on all the varied spacing it's run into. Sticking it into runrigs Bench test as CURIOUS it yields:

    Benchmark: running CURIOUS, LEADING, LEADTRAIL, LTSAVE, SEXEGER, SINGL +E_RE, WHILE_RE, each for at least 5 CPU seconds... CURIOUS: 6 wallclock secs ( 5.42 usr + 0.00 sys = 5.42 CPU) @ 39 +780.44/s (n=215610) LEADING: 6 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 68 +400.00/s (n=361836) LEADTRAIL: 6 wallclock secs ( 5.21 usr + 0.00 sys = 5.21 CPU) @ 28 +325.53/s (n=147576) LTSAVE: 6 wallclock secs ( 5.32 usr + 0.00 sys = 5.32 CPU) @ 30 +798.68/s (n=163849) SEXEGER: 5 wallclock secs ( 5.28 usr + 0.00 sys = 5.28 CPU) @ 50 +426.89/s (n=266254) SINGLE_RE: 6 wallclock secs ( 5.31 usr + 0.00 sys = 5.31 CPU) @ 54 +235.40/s (n=287990) WHILE_RE: 7 wallclock secs ( 5.47 usr + 0.00 sys = 5.47 CPU) @ 34 +396.53/s (n=188149)
    coreolyn

      Mmmhh, you are removing all the spaces from the string, this was not the original question. It was about removing whitespace (which includes \t, \r, \n, \f and your space) from the start and end of a string.

      btw, you are using a character class in your regex where it is not necessary, so just write: s/ +//g; and for this task a regex is overkill, use a transliteration (tr) instead: tr/ //d;

      -- Hofmator

        I figured I wasn't quite getting it, but stuck my nose out there anyway. What's that saying about curiousity? :)

        coreolyn