question about reg exp engine

chuckd has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: question about reg exp engine by GrandFather (Saint) on Aug 03, 2008 at 23:38 UTC
One issue is that the result is clouded by an extra bogus match. Only one of the 'tail' matches is required. Adding the second one is really bad news for the single regex variant. Choosing to 'nibble' (\s) instead of swallowing whole (\s+) does the single regex a huge disservice too. Consider the following benchmark: `Rate sExNibble singleNibble mExNibble multi multiNib +ble single sExNibble 1795/s -- -74% -99% -99% - +99% -99% singleNibble 6881/s 283% -- -95% -95% - +97% -98% mExNibble 134665/s 7403% 1857% -- -5% - +49% -60% multi 142315/s 7830% 1968% 6% -- - +46% -58% multiNibble 263146/s 14562% 3724% 95% 85% + -- -22% single 338640/s 18769% 4822% 151% 138% +29% --` [download] Read more... Benchmark code (1392 Bytes) Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re: question about reg exp engine by broomduster (Priest) on Aug 03, 2008 at 21:58 UTC
Alternation ('\|') in a regex is not a short-circuit operator. It's not "substitute for the first alternative found", rather it's "substitute for any alternatives found." Some other points: I think you want `^\s` not `\^s`, and your alternation form does not terminate the replacement part of the substitution (i.e., you're missing the last '/').	[reply] [d/l] [select]
Re^2: question about reg exp engine by chuckd (Scribe) on Aug 03, 2008 at 22:37 UTC
yes I have a typo in the post. It shoud be ^\s, but that still doesn't give me an answer. Why does it run slower than running all three substitutions on different lines?	[reply]
Re^3: question about reg exp engine by dave_the_m (Monsignor) on Aug 03, 2008 at 23:17 UTC
Why does it run slower than running all three substitutions on different lines? Because the first three are all optimisable; they are all explicitly anchored to the beginning or end of the string, and the regex engine is smart enough to try the match only at the beginning or end of the string, respectively. The combined pattern is too complex to be optimised, so the engine naively tries matching at every position in the (long) string. Dave.	[reply]
Re^3: question about reg exp engine by broomduster (Priest) on Aug 03, 2008 at 23:55 UTC
yes I have a typo in the post. It shoud be ^\s, but that still doesn't give me an answer. If you fix that typo and then run Benchmarks, I think you will see that they are about the same speed. I see speed differences of 0-3% with the typo fixed, and 20-25% with the typo in place.... probably because some optimization is possible when the regex says "beginning of string" and not '^' in an arbitrary place in the string.	[reply]
Re^4: question about reg exp engine by GrandFather (Saint) on Aug 04, 2008 at 00:04 UTC
Re^5: question about reg exp engine by broomduster (Priest) on Aug 04, 2008 at 00:11 UTC
Some notes below your chosen depth have not been shown here
Re: question about reg exp engine by eosbuddy (Scribe) on Aug 03, 2008 at 23:19 UTC
Not sure about my argument - but a plausible explanation (after reading Mastering RegEx on the mechanics of ex proc ..) is that your second choice with the `\|' OR part actually needs to do backtracking i.e. regex tries to match the first expression and if it has to do that, it follows through your first choice, looks for a match, backtracks, takes your 2nd choice and looks for a match and so on. Since the backtracking is absent for the choices on three separate lines, I would think that would explain the speed... my 2 cents :-) Update: I am wrong after all ... nonetheless, testing your expression against your first two choices might mean the same speed the last one with the + sign might be the cause of the latency in the chain. If I see the string, I might be able to experiment with it to shed more light.	[reply]
Re: question about reg exp engine by linuxer (Curate) on Aug 04, 2008 at 08:48 UTC
Your first three regex do not need the modifier /g. You are using anchors, so the /g modifier is useless here. If you leave it out, the work will be done faster... Your second regex (one whitespace at the end of line/string) is redundant, because it is already included in the third regex (one or more whitespace(s) at the end of line/string). You really want to remove all whitespaces from the end of line, but only one from the beginning? `$string =~ s/^\S+//; # assumed to remove all whitespaces from the b +eginning $string =~ s/\S+$//; # assumed to remove all whitespaces from the e +nd $string =~ s/^\s+\|\s+$//g; # /g needed here` [download] Benchmark: `#!/usr/bin/perl # vi:ts=4 sw=4 et: use strict; use warnings; use Benchmark qw( cmpthese ); my $results = cmpthese( -3, { 'r2/g' => sub { my $string = ( ' ' x 1000 ) . 'x' . ( ' ' . 1000 ); $string =~ s/^\s+//g; $string =~ s/\s+$//g; }, 'r2' => sub { my $string = ( ' ' x 1000 ) . 'x' . ( ' ' . 1000 ); $string =~ s/^\s+//; $string =~ s/\s+$//; }, 'r1' => sub { my $string = ( ' ' x 1000 ) . 'x' . ( ' ' . 1000 ); $string =~ s/^\s+\|\s+$//g; } } );` [download] Result: `Rate r1 r2/g r2 r1 135605/s -- -19% -25% r2/g 168081/s 24% -- -7% r2 181494/s 34% 8% --` [download] If your text in $string is multilined, you should check perlre for the /m modifier. update: fixed typo.	[reply] [d/l] [select]