Regex failure interpretation

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex failure interpretation by kvale (Monsignor) on Mar 19, 2004 at 22:32 UTC
I'll take a shot. What this does is capture a single character to $1, again and again. The last capture is the one you see in $1. So the regexp is actually capturing the last element of your binary string. To capture the first element, try `$_ =~ m/^([01])[01]*$/ and print "$_:'$1'\n" for qw[ 0 1 00 11 10 01 012];` -Mark	[reply] [d/l]
Re: Regex failure interpretation by eXile (Priest) on Mar 19, 2004 at 22:48 UTC
I'll go with Mark's explanation. YAPE::Regex::Explain says: `use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/^([01])+$/)->explain();` [download] output: The regular expression: (?-imsx:^([01])+$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- ( group and capture to \1 (1 or more times (matching the most amount possible)): ---------------------------------------------------------------------- [01] any character of: '0', '1' ---------------------------------------------------------------------- )+ end of \1 (NOTE: because you're using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1) ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] So the ()+ match you do puts the last match in $1.	[reply] [d/l] [select]
Re: Regex failure interpretation by Aristotle (Chancellor) on Mar 20, 2004 at 09:08 UTC
If you want to assert that a string only contains a certain class of characters it's better to make sure it doesn't match the opposite; otherwise you're making the regex engine unnecessarily jump through hoops. And if you want the first character, well, just pick it out. `print substr($_, 0, 1) unless /[^01]/;` [download] Cleaner and a lot more efficient. Bench the solutions with a few very long strings if you're curious. Makeshifts last the longest.	[reply] [d/l]
Re: Re: Regex failure interpretation by BrowserUk (Patriarch) on Mar 20, 2004 at 11:36 UTC
I'm probably doing something dumb, but I can't reproduce that? `use Benchmark qw[ cmpthese ]; @good = map{ join'',map{ rand() < .5 ? 1 : 0 } 1..1000 } 1..100; @bad = map{ substr( $s=$_, rand( 1000 ), 1 ) = 2 } @good; cmpthese( -3, { '+ve' => q[ $n1 = grep /^[01]+$/, @good, @bad ], '-ve' => q[ $n2 = grep !/[^01]/, @good, @bad ] }); Rate -ve +ve -ve 467/s -- -80% +ve 2363/s 406% -- print $n1, $n2; 100 100` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re^3: Regex failure interpretation by Aristotle (Chancellor) on Mar 20, 2004 at 12:13 UTC
D'oh. Figures that the complex expression would trigger an optimization: $ perl -Mre=debug -e'/[^01]/, /^[01]+$/' Freeing REx: `","' Compiling REx `[^01]' size 12 Got 100 bytes for offset annotations. first at 1 1: ANYOF[\0-/2-\377{unicode_all}](12) 12: END(0) stclass `ANYOF[\0-/2-\377{unicode_all}]' minlen 1 Offsets: [12] 1[5] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 6[0] Compiling REx `^[01]+$' size 15 Got 124 bytes for offset annotations. first at 2 synthetic stclass `ANYOF[01]'. 1: BOL(2) 2: PLUS(14) 3: ANYOF[01](0) 14: EOL(15) 15: END(0) floating `'$ at 1..2147483647 (checking floating) stclass `ANYOF[01]' +anchored(BOL) minlen 1 Offsets: [15] 1[1] 6[4] 2[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[ +0] 7[1] 8[0] Freeing REx: `"[^01]"' Freeing REx: `"^[01]+$"' [download] On the other hand when using one of the builtins like `\w`, it still manages to beat the complex expression by a small margin: #!/usr/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); my $len = shift; my @abc = 'a' .. 'z'; my @good = ('') x 100; for(@good) { $_ .= $abc[ rand @abc ] for 1 .. $len } my @bad = @good; substr $_, rand( length ), 1, ' ' for @bad; my %bench = ( excl => sub { grep /\A\w+\z/, @good, @bad }, incl => sub { grep !/\W/, @good, @bad }, ); print map "$_ matches: ".$bench{$_}->()."\n", keys %bench; cmpthese -3 => \%bench; __END__ $ perl t.pl 100 excl matches: 100 incl matches: 100 Benchmark: running excl, incl for at least 3 CPU seconds... excl: 3 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 68 +26.88/s (n=21846) incl: 3 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 68 +72.50/s (n=21992) Rate excl incl excl 6827/s -- -1% incl 6872/s 1% -- $ perl t.pl 1000 excl matches: 100 incl matches: 100 Benchmark: running excl, incl for at least 3 CPU seconds... excl: 3 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 14 +31.65/s (n=4524) incl: 3 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 12 +55.38/s (n=3967) Rate incl excl incl 1255/s -- -12% excl 1432/s 14% -- $ perl t.pl 10000 excl matches: 100 incl matches: 100 Benchmark: running excl, incl for at least 3 CPU seconds... excl: 4 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU) @ 15 +8.26/s (n=508) incl: 4 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 13 +5.76/s (n=429) Rate incl excl incl 136/s -- -14% excl 158/s 17% -- $ perl -Mre=debug -e'/\W/, /^\w+$/' Freeing REx: `","' Compiling REx `\W' size 2 Got 20 bytes for offset annotations. first at 1 1: NALNUM(2) 2: END(0) stclass `NALNUM' minlen 1 Offsets: [2] 1[2] 3[0] Compiling REx `^\w+$' size 5 Got 44 bytes for offset annotations. first at 2 synthetic stclass `ANYOF[0-9A-Z_a-z{unicode_all}]'. 1: BOL(2) 2: PLUS(4) 3: ALNUM(0) 4: EOL(5) 5: END(0) floating `'$ at 1..2147483647 (checking floating) stclass `ANYOF[0-9A- +Z_a-z{unicode_all}]' anchored(BOL) minlen 1 Offsets: [5] 1[1] 4[2] 2[2] 5[1] 6[0] Freeing REx: `"\\W"' Freeing REx: `"^\\w+$"' [download] Makeshifts last the longest.	[reply] [d/l] [select]
Re: Regex failure interpretation by Abigail-II (Bishop) on Mar 20, 2004 at 01:18 UTC
I find it quite logical $1 contains the last capture. After all, you'd expect $v to be 3 in `$v = $_ for 1, 2, 3;` [download] as well (I hope). Here's a way to capute the first character: `$ perl -wle 'm[^(?{undef $var})(?:([01])(?{$var //= $^N}))+$] and print "$_: '"'"'$var'"'"'" for qw[ 0 1 00 11 10 01 012]' 0: '0' 1: '1' 00: '0' 11: '1' 10: '1' 01: '0'` [download] Abigail	[reply] [d/l] [select]
Re: Re: Regex failure interpretation by BrowserUk (Patriarch) on Mar 20, 2004 at 02:07 UTC
Now I've seen the explanations, I agree, it is logical. That said, I still wish that captures were pushed to a global array (say @^N) rather than to $1 $2 etc. If @^N contained the captures pushed in the order they were captured it would alleviate many of the games one has to play when capturing variable numbers of things. (BTW. There are easier ways of correcting the regex:) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply]
Re: Regex failure interpretation by tachyon (Chancellor) on Mar 20, 2004 at 00:23 UTC
Syntax error. You ask for one char, you get one char. The + needs to be within the capture and then it works as advertised. `$_ =~ m[^([01]+)$] and print "$_:'$1'\n" for qw[ 0 1 00 11 10 01 012]; __DATA__ 0:'0' 1:'1' 00:'00' 11:'11' 10:'10' 01:'01'` [download] cheers tachyon	[reply] [d/l]
Re: Re: Regex failure interpretation by BrowserUk (Patriarch) on Mar 20, 2004 at 00:40 UTC
Not quite. As I mentioned above, I only want to capture the first character, not the whole string. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re: Re: Re: Regex failure interpretation by tachyon (Chancellor) on Mar 20, 2004 at 00:45 UTC
`$_ =~ m/^([01])[01]*$/ and print "$_:'$1'\n" for qw[ 0 1 00 11 10 01 0 +12];` [download] cheers tachyon	[reply] [d/l]
Re: Re: Re: Re: Regex failure interpretation by BrowserUk (Patriarch) on Mar 20, 2004 at 00:52 UTC
Re: Regex failure interpretation by bart (Canon) on Mar 20, 2004 at 18:44 UTC
First there's the solution that kvale gave, perhaps the ~~best~~ simplest choice: `/^([01])[01]$/` [download] but there are some other possibilities using lookahead: `/^(?=(.))[01]+$/` [download] `/^(?=[01]+$)(.)/` [download] I feel a benchmark coming up... `use Benchmark 'cmpthese'; cmpthese(-3, { plain => sub { my $x; ($x) = /^([01])[01]$/ foreach qw(0 1 00 11 1 +0 01 000x00) }, capture_in_lookahead => sub { my $x; ($x) = /^(?=(.))[01]+$/ foreac +h qw(0 1 00 11 10 01 000x00) }, lookahead_then_capture => sub { my $x; ($x) = /^(?=[01]+$)(.)/ fore +ach qw(0 1 00 11 10 01 000x00) }, });` [download] Result: Benchmark: running capture_in_lookahead, lookahead_then_capture, plain +, each for at least 3 CPU seconds... capture_in_lookahead: 3 wallclock secs ( 3.41 usr + 0.00 sys = 3.41 + CPU) @ 24781.52/s (n=84505) lookahead_then_capture: 4 wallclock secs ( 3.13 usr + 0.00 sys = 3. +13 CPU) @ 24592.33/s (n=76974) plain: 3 wallclock secs ( 3.08 usr + 0.00 sys = 3.08 CPU) @ 24 +988.64/s (n=76965) Rate lookahead_then_capture capture_in_looka +head plain lookahead_then_capture 24592/s -- + -1% -2% capture_in_lookahead 24782/s 1% + -- -1% plain 24989/s 2% + 1% -- [download] Well... "plain" appears to be slightly faster, but I wouldn't bother worrying about the difference. Whjat's a few percent, anyway?	[reply] [d/l] [select]
Re: Re: Regex failure interpretation by BrowserUk (Patriarch) on Mar 21, 2004 at 01:03 UTC
...I wouldn't bother worrying about the difference.... Funnily enough, I wasn't worrying about efficiency (this time:). Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re^2: Regex failure interpretation (noise) by tye (Sage) on Mar 21, 2004 at 07:12 UTC
In my experience, differences of 2% when benchmarking are more likely to be noise than anything else. I notice that the speed matches the order that the tests were run (first slowest, last fastest). Rename the cases (so that they sort in a different order and so get run in a different order) and you might find a different 'winner'. For such tiny differences, just running it again could easily give you a different winner. Certainly, running on different platforms is unlikely to always produce the same winner even when the difference is more like 5% or even 15%. In future, you might want to have the benchmark run each test twice so you can get a feel for how much of one type of noise you have. For example, for each test case, foo, you give Benchmark a_foo and b_foo that point to the same code. Also, your test strings are even shorter than the expected inputs. That right there is often enough to make a benchmark find nothing. And you have a loop in your test code. Processing the loop could be taking more time than running regexes against your tiny strings. - tye	[reply]