Re: Regex failure interpretation
by kvale (Monsignor) on Mar 19, 2004 at 22:32 UTC
|
I'll take a shot. What this does is capture a single character to $1, again and again. The last capture is the one you see in $1. So the regexp is actually capturing the last element of your binary string.
To capture the first element, try
$_ =~ m/^([01])[01]*$/ and print "$_:'$1'\n" for qw[ 0 1 00 11 10 01 012];
| [reply] [d/l] |
Re: Regex failure interpretation
by eXile (Priest) on Mar 19, 2004 at 22:48 UTC
|
I'll go with Mark's explanation. YAPE::Regex::Explain says:
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/^([01])+$/)->explain();
output:
The regular expression:
(?-imsx:^([01])+$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1 (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[01] any character of: '0', '1'
----------------------------------------------------------------------
)+ end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
So the ()+ match you do puts the last match in $1. | [reply] [d/l] [select] |
Re: Regex failure interpretation
by Aristotle (Chancellor) on Mar 20, 2004 at 09:08 UTC
|
If you want to assert that a string only contains a certain class of characters it's better to make sure it doesn't match the opposite; otherwise you're making the regex engine unnecessarily jump through hoops. And if you want the first character, well, just pick it out.
print substr($_, 0, 1) unless /[^01]/;
Cleaner and a lot more efficient. Bench the solutions with a few very long strings if you're curious.
Makeshifts last the longest.
| [reply] [d/l] |
|
|
use Benchmark qw[ cmpthese ];
@good = map{ join'',map{ rand() < .5 ? 1 : 0 } 1..1000 } 1..100;
@bad = map{ substr( $s=$_, rand( 1000 ), 1 ) = 2 } @good;
cmpthese( -3, {
'+ve' => q[ $n1 = grep /^[01]+$/, @good, @bad ],
'-ve' => q[ $n2 = grep !/[^01]/, @good, @bad ]
});
Rate -ve +ve
-ve 467/s -- -80%
+ve 2363/s 406% --
print $n1, $n2;
100 100
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] |
|
|
D'oh. Figures that the complex expression would trigger an optimization:
$ perl -Mre=debug -e'/[^01]/, /^[01]+$/'
Freeing REx: `","'
Compiling REx `[^01]'
size 12 Got 100 bytes for offset annotations.
first at 1
1: ANYOF[\0-/2-\377{unicode_all}](12)
12: END(0)
stclass `ANYOF[\0-/2-\377{unicode_all}]' minlen 1
Offsets: [12]
1[5] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 6[0]
Compiling REx `^[01]+$'
size 15 Got 124 bytes for offset annotations.
first at 2
synthetic stclass `ANYOF[01]'.
1: BOL(2)
2: PLUS(14)
3: ANYOF[01](0)
14: EOL(15)
15: END(0)
floating `'$ at 1..2147483647 (checking floating) stclass `ANYOF[01]'
+anchored(BOL) minlen 1
Offsets: [15]
1[1] 6[4] 2[4] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[
+0] 7[1] 8[0]
Freeing REx: `"[^01]"'
Freeing REx: `"^[01]+$"'
On the other hand when using one of the builtins like \w, it still manages to beat the complex expression by a small margin:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw( cmpthese );
my $len = shift;
my @abc = 'a' .. 'z';
my @good = ('') x 100;
for(@good) { $_ .= $abc[ rand @abc ] for 1 .. $len }
my @bad = @good;
substr $_, rand( length ), 1, ' ' for @bad;
my %bench = (
excl => sub { grep /\A\w+\z/, @good, @bad },
incl => sub { grep !/\W/, @good, @bad },
);
print map "$_ matches: ".$bench{$_}->()."\n", keys %bench;
cmpthese -3 => \%bench;
__END__
$ perl t.pl 100
excl matches: 100
incl matches: 100
Benchmark: running excl, incl for at least 3 CPU seconds...
excl: 3 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 68
+26.88/s (n=21846)
incl: 3 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 68
+72.50/s (n=21992)
Rate excl incl
excl 6827/s -- -1%
incl 6872/s 1% --
$ perl t.pl 1000
excl matches: 100
incl matches: 100
Benchmark: running excl, incl for at least 3 CPU seconds...
excl: 3 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 14
+31.65/s (n=4524)
incl: 3 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 12
+55.38/s (n=3967)
Rate incl excl
incl 1255/s -- -12%
excl 1432/s 14% --
$ perl t.pl 10000
excl matches: 100
incl matches: 100
Benchmark: running excl, incl for at least 3 CPU seconds...
excl: 4 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU) @ 15
+8.26/s (n=508)
incl: 4 wallclock secs ( 3.16 usr + 0.00 sys = 3.16 CPU) @ 13
+5.76/s (n=429)
Rate incl excl
incl 136/s -- -14%
excl 158/s 17% --
$ perl -Mre=debug -e'/\W/, /^\w+$/'
Freeing REx: `","'
Compiling REx `\W'
size 2 Got 20 bytes for offset annotations.
first at 1
1: NALNUM(2)
2: END(0)
stclass `NALNUM' minlen 1
Offsets: [2]
1[2] 3[0]
Compiling REx `^\w+$'
size 5 Got 44 bytes for offset annotations.
first at 2
synthetic stclass `ANYOF[0-9A-Z_a-z{unicode_all}]'.
1: BOL(2)
2: PLUS(4)
3: ALNUM(0)
4: EOL(5)
5: END(0)
floating `'$ at 1..2147483647 (checking floating) stclass `ANYOF[0-9A-
+Z_a-z{unicode_all}]' anchored(BOL) minlen 1
Offsets: [5]
1[1] 4[2] 2[2] 5[1] 6[0]
Freeing REx: `"\\W"'
Freeing REx: `"^\\w+$"'
Makeshifts last the longest.
| [reply] [d/l] [select] |
Re: Regex failure interpretation
by Abigail-II (Bishop) on Mar 20, 2004 at 01:18 UTC
|
I find it quite logical $1 contains the last capture.
After all, you'd expect $v to be 3 in
$v = $_ for 1, 2, 3;
as well (I hope).
Here's a way to capute the first character:
$ perl -wle 'm[^(?{undef $var})(?:([01])(?{$var //= $^N}))+$]
and print "$_: '"'"'$var'"'"'" for qw[ 0 1 00 11 10 01 012]'
0: '0'
1: '1'
00: '0'
11: '1'
10: '1'
01: '0'
Abigail | [reply] [d/l] [select] |
|
|
Now I've seen the explanations, I agree, it is logical. That said, I still wish that captures were pushed to a global array (say @^N) rather than to $1 $2 etc.
If @^N contained the captures pushed in the order they were captured it would alleviate many of the games one has to play when capturing variable numbers of things.
(BTW. There are easier ways of correcting the regex:)
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] |
Re: Regex failure interpretation
by tachyon (Chancellor) on Mar 20, 2004 at 00:23 UTC
|
Syntax error. You ask for one char, you get one char. The + needs to be within the capture and then it works as advertised.
$_ =~ m[^([01]+)$] and print "$_:'$1'\n" for qw[ 0 1 00 11 10 01 012];
__DATA__
0:'0'
1:'1'
00:'00'
11:'11'
10:'10'
01:'01'
| [reply] [d/l] |
|
|
| [reply] [d/l] |
|
|
$_ =~ m/^([01])[01]*$/ and print "$_:'$1'\n" for qw[ 0 1 00 11 10 01 0
+12];
| [reply] [d/l] |
|
|
Re: Regex failure interpretation
by bart (Canon) on Mar 20, 2004 at 18:44 UTC
|
First there's the solution that kvale gave, perhaps the best simplest choice:
/^([01])[01]*$/
but there are some other possibilities using lookahead:
/^(?=(.))[01]+$/
/^(?=[01]+$)(.)/
I feel a benchmark coming up...
use Benchmark 'cmpthese';
cmpthese(-3, {
plain => sub { my $x; ($x) = /^([01])[01]*$/ foreach qw(0 1 00 11 1
+0 01 000x00) },
capture_in_lookahead => sub { my $x; ($x) = /^(?=(.))[01]+$/ foreac
+h qw(0 1 00 11 10 01 000x00) },
lookahead_then_capture => sub { my $x; ($x) = /^(?=[01]+$)(.)/ fore
+ach qw(0 1 00 11 10 01 000x00) },
});
Result:
Benchmark: running capture_in_lookahead, lookahead_then_capture, plain
+, each for at least 3 CPU seconds...
capture_in_lookahead: 3 wallclock secs ( 3.41 usr + 0.00 sys = 3.41
+ CPU) @ 24781.52/s (n=84505)
lookahead_then_capture: 4 wallclock secs ( 3.13 usr + 0.00 sys = 3.
+13 CPU) @ 24592.33/s (n=76974)
plain: 3 wallclock secs ( 3.08 usr + 0.00 sys = 3.08 CPU) @ 24
+988.64/s (n=76965)
Rate lookahead_then_capture capture_in_looka
+head plain
lookahead_then_capture 24592/s --
+ -1% -2%
capture_in_lookahead 24782/s 1%
+ -- -1%
plain 24989/s 2%
+ 1% --
Well... "plain" appears to be slightly faster, but I wouldn't bother worrying about the difference. Whjat's a few percent, anyway?
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
|
|
In my experience, differences of 2% when benchmarking are more likely to be noise than anything else.
I notice that the speed matches the order that the tests were run (first slowest, last fastest). Rename the cases (so that they sort in a different order and so get run in a different order) and you might find a different 'winner'.
For such tiny differences, just running it again could easily give you a different winner.
Certainly, running on different platforms is unlikely to always produce the same winner even when the difference is more like 5% or even 15%.
In future, you might want to have the benchmark run each test twice so you can get a feel for how much of one type of noise you have. For example, for each test case, foo, you give Benchmark a_foo and b_foo that point to the same code.
Also, your test strings are even shorter than the expected inputs. That right there is often enough to make a benchmark find nothing. And you have a loop in your test code. Processing the loop could be taking more time than running regexes against your tiny strings.
| [reply] |