japhy blabs about regexes (again)

Replies are listed 'Best First'.
Re: japhy blabs about regexes (again) by runrig (Abbot) on Jul 16, 2001 at 22:49 UTC
I'm using perl 5.6.1, and I don't get those results. Here's the script I'm using. I was just curious about leading spaces by itself, so I threw that in, but the WHILE_RE version is definitely slower than either the SEXEGER or the SINGLE_RE version. And on the leading & trailing white space stripping, the LEADTRAIL is version is quicker than the LTSAVE version. #!/usr/local/bin/perl -w use strict; use Benchmark; my $str = " a b c d "; timethese(-5, { LEADING=>\&leading, LEADTRAIL=>\&lead_trail, LTSAVE=>\&lt_save, SEXEGER=>\&sexeger, SINGLE_RE=>\&single_re, WHILE_RE=>\&while_sub, }); sub leading { local $_ = $str; s/^\s+//; $_; } sub lead_trail { local $_ = $str; s/^\s+\|\s+$//g; $_; } sub lt_save { local $_ = $str; s/^\s(.?)\s$/$1/; $_; } sub sexeger { local $_ = reverse $str; s/^\s+//; reverse $str; } sub single_re { local $_ = $str; s/\s+$//; $_; } sub while_sub { local $_ = $str; 1 while s/\s$//; $_; } ~/tst >./tst3 Benchmark: running LEADING, LEADTRAIL, LTSAVE, SEXEGER, SINGLE_RE, WHI +LE_RE, eac h for at least 5 CPU seconds... LEADING: 5 wallclock secs ( 5.00 usr + 0.00 sys = 5.00 CPU) @ 77 +596.80/s ( n=387984) LEADTRAIL: 5 wallclock secs ( 5.22 usr + 0.00 sys = 5.22 CPU) @ 26 +504.21/s ( n=138352) LTSAVE: 5 wallclock secs ( 5.18 usr + 0.00 sys = 5.18 CPU) @ 18 +690.54/s ( n=96817) SEXEGER: 4 wallclock secs ( 5.35 usr + 0.00 sys = 5.35 CPU) @ 57 +972.52/s ( n=310153) SINGLE_RE: 5 wallclock secs ( 5.23 usr + 0.00 sys = 5.23 CPU) @ 56 +434.42/s ( n=295152) WHILE_RE: 6 wallclock secs ( 5.00 usr + 0.00 sys = 5.00 CPU) @ 36 +090.00/s ( n=180450) [download] Update:* I cut 'n pasted the code and output, then fixed LTSAVE, then forgot to repaste. I guess no one looked closely enough to notice that LT_SAVE was actually beating LEADTRAIL. Its all fixed now though :)	[reply] [d/l]
Re: Re: japhy blabs about regexes (again) by japhy (Canon) on Jul 16, 2001 at 23:07 UTC
Here's my output (from bleadperl): Benchmark: running F_plus, F_sexeger, F_while, P_plus, P_sexeger, P_wh +ile, each for at least 5 CPU seconds... F_plus: 6 wallclock secs ( 5.38 usr + 0.02 sys = 5.40 CPU) @ 38 +010.19/s (n=205255) F_sexeger: 6 wallclock secs ( 5.19 usr + 0.00 sys = 5.19 CPU) @ 82 +085.16/s (n=426022) F_while: 5 wallclock secs ( 5.23 usr + 0.00 sys = 5.23 CPU) @ 99 +934.23/s (n=522656) P_plus: 6 wallclock secs ( 5.22 usr + 0.00 sys = 5.22 CPU) @ 34 +659.58/s (n=180923) P_sexeger: 7 wallclock secs ( 5.11 usr + 0.00 sys = 5.11 CPU) @ 54 +039.53/s (n=276142) P_while: 6 wallclock secs ( 5.14 usr + 0.00 sys = 5.14 CPU) @ 58 +260.31/s (n=299458) Rate P_plus F_plus P_sexeger P_while F_sexeger F_whi +le P_plus 34660/s -- -9% -36% -41% -58% -6 +5% F_plus 38010/s 10% -- -30% -35% -54% -6 +2% P_sexeger 54040/s 56% 42% -- -7% -34% -4 +6% P_while 58260/s 68% 53% 8% -- -29% -4 +2% F_sexeger 82085/s 137% 116% 52% 41% -- -1 +8% F_while 99934/s 188% 163% 85% 72% 22% +-- [download] The F stands for "fail", and the P stands for "pass". For me, the while-approach fails AND succeeds faster than the sexeger- and plus-approaches, and sexeger fails AND succeeds faster than the plus-approach. And here's the code I ran. `#!/usr/bin/perl use Benchmark 'cmpthese'; my $X = "a b c d e f g h i j k l "; my $Y = "a b c d e f g h i j k l"; cmpthese(-5, { P_while => sub { my $x = $X; 1 while $x =~ s/\s$//; }, P_plus => sub { my $x = $X; $x =~ s/\s+$//; }, P_sexeger => sub { my $x = reverse $X; $x =~ s/^\s+//; $x = reverse $x; }, F_while => sub { my $x = $Y; 1 while $x =~ s/\s$//; }, F_plus => sub { my $x = $Y; $x =~ s/\s+$//; }, F_sexeger => sub { my $x = reverse $Y; $x =~ s/^\s+//; $x = reverse $x; }, });` [download] _____________________________________________________ Jeff `japhy` Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l] [select]
Re: Re: Re: japhy blabs about regexes (again) by runrig (Abbot) on Jul 16, 2001 at 23:28 UTC
That explains it. If you add a few spaces to the end of your 'passing' string, then P_while will come in last. I suppose that's because of the cost in executing the regex a few more times. So if you expect its likely that there's a few spaces to truncate, better not to use the while :) But its still probably a good case for optimizing regexes anchored at the end of a string.	[reply]
Re: japhy blabs about regexes (again) by srawls (Friar) on Jul 16, 2001 at 22:06 UTC
Hmm... I find the `1 while s/\s$//;` very interesting, but shocking as well. I guess I just blindly asumed that `s/\s*$//` was optimized. Anyway, I, too, am working on a regex engine; mine is coded in C. It's a slow process for me, but I'm learning a lot--I haven't tried any optimizations yet, though, I'll have to finish before I start that : ) The 15 year old, freshman programmer, Stephen Rawls	[reply] [d/l] [select]
Re: japhy blabs about regexes (again) by DrZaius (Monk) on Jul 16, 2001 at 22:22 UTC
I haven't tried any benchmarks and I am likely not to, but this little ditty comes to mind: `s!\s*\s$!!` Although, I can also see how it would be slower :)	[reply] [d/l]
Re: Re: japhy blabs about regexes (again) by japhy (Canon) on Jul 16, 2001 at 22:43 UTC
Oh, I was hoping someone would bite. `;)` That regex is worse. You'd think by breaking `\s+` down into `\s\s`, you'd get the best of both worlds -- the "1 or more" meaning, and the "trailing character" optimization. No such luck.* In fact, the regex does MORE work, because of backtracking. `:(` You see, Perl doesn't even see if the string has ANY trailing whitespace if you do `/\s+$/`. Meaning it takes a while to fail for a string like "a b c d e f g h i j k l m n o p q r s t u v w x y z". How sad. `# try this, and be sad :( use re 'debug'; "abc def ghi" =~ /\s+$/; "abc def ghi" =~ /\s\s$/;` [download] The output (if you understand it) will upset you. Well, it upset me. _____________________________________________________ Jeff `japhy` Pinyan: Perl, regex, and perl* hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: japhy blabs about regexes (again)(ot)(boo) by boo_radley (Parson) on Jul 16, 2001 at 23:37 UTC
9	[reply]
Re: japhy blabs about regexes (again) by Wookie (Beadle) on Jul 17, 2001 at 16:47 UTC
Surely all this depends on how much white space you're typically expecting. Could the following not also be more efficent depending on the data: `if ($blah=~m/^\s/) { $blah=~s/^\s+//; }` [download] or indeed (thanks to MZSanford for this one): `if (index($blah," ")) == 0 { $blah=~s/^\s+//; }` [download] and for trailing - use rindex :). UPDATE: Oops - The index solution only works for actual spaces (thanks Hofmator). That possibly gets you a little closer to the best of both worlds? game(Wookie,opponent) eq 'Wookie' ? undef $problem : remove_limbs(arms,opponent);	[reply] [d/l] [select]
Re: Re: japhy blabs about regexes (again) by japhy (Canon) on Jul 17, 2001 at 18:45 UTC
The problem isn't leading spaces. Leading spaces are terribly easy: `^\s+` says "find the beginning of the string and one or more spaces". It's trailing spaces that are hideous. Instead of `\s+$` meaning "find the end of the string, and one or more spaces before it", it means "find one or more spaces, and then try for the end of the string". That's sad of it. _____________________________________________________ Jeff `japhy` Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re: japhy blabs about regexes (again) by MZSanford (Curate) on Jul 17, 2001 at 20:43 UTC
Being curious as Perl people will be, i tired an rindex based solution. It unfortunatly would only work with true spaces (" "), and did not fare too well in testing, but, we all learn when things like this happen ... here is my code/output : Code Addition : `P_rindex => sub { my $x = $X; while (1) { my $i = length($x); (rindex($x," ") == $i ? chop : las +t) }; }, F_rindex => sub { my $x = $Y; while (1) { my $i = length($x); (rindex($x," ") == $i ? chop : las +t) }; },` [download] Output : Benchmark: running F_plus, F_rindex, F_sexeger, F_while, P_plus, P_rin +dex,P_sexeger, P_while, each for at least 5 CPU seconds... F_plus: 6 wallclock secs ( 5.15 usr + 0.00 sys = 5.15 CPU) @ 37 +600.97/s(n=193645) F_rindex: 6 wallclock secs ( 5.45 usr + 0.00 sys = 5.45 CPU) @ 59 +183.30/s(n=322549) F_sexeger: 7 wallclock secs ( 5.24 usr + 0.00 sys = 5.24 CPU) @ 59 +187.60/s(n=310143) F_while: 7 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 69 +741.02/s(n=368930) P_plus: 5 wallclock secs ( 5.46 usr + 0.00 sys = 5.46 CPU) @ 33 +136.08/s(n=180923) P_rindex: 6 wallclock secs ( 5.31 usr + 0.00 sys = 5.31 CPU) @ 60 +743.69/s(n=322549) P_sexeger: 5 wallclock secs ( 5.36 usr + 0.00 sys = 5.36 CPU) @ 49 +557.65/s(n=265629) P_while: 6 wallclock secs ( 5.09 usr + 0.00 sys = 5.09 CPU) @ 50 +179.17/s(n=255412) Rate P_plus F_plus P_sexeger P_while F_rindex F_sexeger P_ +rindex F_while P_plus 33136/s -- -12% -33% -34% -44% -44% + -45%-52% F_plus 37601/s 13% -- -24% -25% -36% -36% + -38%-46% P_sexeger 49558/s 50% 32% -- -1% -16% -16% + -18%-29% P_while 50179/s 51% 33% 1% -- -15% -15% + -17%-28% F_rindex 59183/s 79% 57% 19% 18% -- -0% + -3%-15% F_sexeger 59188/s 79% 57% 19% 18% 0% -- + -3%-15% P_rindex 60744/s 83% 62% 23% 21% 3% 3% + ---13% F_while 69741/s 110% 85% 41% 39% 18% 18% + 15%-- [download] As usual, nothing fantastic, but worth the learning experiance. OH, a sarcasm detector, that�s really useful	[reply] [d/l] [select]
Re: Re: japhy blabs about regexes (again) by japhy (Canon) on Jul 18, 2001 at 01:06 UTC
I wrote four XS functions, and one of them (used here) is called `rnchrpos(STR, CHARS [, POS])`. It works just like `rindex()`, except it searches for a character that is not one of a set of characters. It's similar to doing: `sub rnchrpos { my ($str, $chars, $pos) = @_; $pos \|\|= $pos; $pos += length $str if $pos < 0; return ( substr($str, 0, $pos) =~ /[\Q$chars\E]\z/ ? $-[0] - 1 : -1 ); }` [download] Except that mine is faster, since it's written in C. The results vary greatly. It comes down to knowing your string. If your string has lots of chunks of whitespace, the regular regex approach is bad. If your string has a long string of trailing whitespace, the `rnchrpos()` approach is bad. If there's no trailing whitespace, or very little, the while approach and the `rnchrpos()` are good. The sexeger approach is pretty good if the string isn't very long. If your string has a few chunks of whitespace the regular regex approach is good. We live in a very variable world. _____________________________________________________ Jeff `japhy` Pinyan: Perl, regex, and perl* hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Why not write your own? by John M. Dlugosz (Monsignor) on Jul 17, 2001 at 02:12 UTC
I found, a while back, a regex library for C++ that was mostly Perl compatible. As you point out, the code actually in Perl is rather difficult to fathem and simply grabbing that part and incorporating it into a C program is not possible. But having that kind of power in the "other" language is so lustful...	[reply]
Re: Why not write your own? by japhy (Canon) on Jul 17, 2001 at 02:15 UTC
I am writing my own engine. Hopefully, I'll gain some insights into Perl's. _____________________________________________________ Jeff `japhy` Pinyan: Perl, regex, and perl hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply]
Re: japhy blabs about regexes (again) by coreolyn (Parson) on Jul 17, 2001 at 19:40 UTC
I've been using `s/[ ]+//g;` which seems to work effectively on all the varied spacing it's run into. Sticking it into runrigs Bench test as CURIOUS it yields: Benchmark: running CURIOUS, LEADING, LEADTRAIL, LTSAVE, SEXEGER, SINGL +E_RE, WHILE_RE, each for at least 5 CPU seconds... CURIOUS: 6 wallclock secs ( 5.42 usr + 0.00 sys = 5.42 CPU) @ 39 +780.44/s (n=215610) LEADING: 6 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 68 +400.00/s (n=361836) LEADTRAIL: 6 wallclock secs ( 5.21 usr + 0.00 sys = 5.21 CPU) @ 28 +325.53/s (n=147576) LTSAVE: 6 wallclock secs ( 5.32 usr + 0.00 sys = 5.32 CPU) @ 30 +798.68/s (n=163849) SEXEGER: 5 wallclock secs ( 5.28 usr + 0.00 sys = 5.28 CPU) @ 50 +426.89/s (n=266254) SINGLE_RE: 6 wallclock secs ( 5.31 usr + 0.00 sys = 5.31 CPU) @ 54 +235.40/s (n=287990) WHILE_RE: 7 wallclock secs ( 5.47 usr + 0.00 sys = 5.47 CPU) @ 34 +396.53/s (n=188149) [download] coreolyn	[reply] [d/l] [select]
Re: Re: japhy blabs about regexes (again) by Hofmator (Curate) on Jul 17, 2001 at 19:54 UTC
Mmmhh, you are removing all the spaces from the string, this was not the original question. It was about removing whitespace (which includes \t, \r, \n, \f and your space) from the start and end of a string. btw, you are using a character class in your regex where it is not necessary, so just write: `s/ +//g;` and for this task a regex is overkill, use a transliteration (tr) instead: `tr/ //d;` -- Hofmator	[reply] [d/l] [select]
Re: Re: Re: japhy blabs about regexes (again) by coreolyn (Parson) on Jul 17, 2001 at 20:09 UTC
I figured I wasn't quite getting it, but stuck my nose out there anyway. What's that saying about curiousity? :) coreolyn	[reply]