Re: Puzzled by regex
by davido (Cardinal) on Apr 10, 2013 at 06:16 UTC
|
There is a difference, but it's not necessarily in what is getting matched, but rather, in how it's matching. Example:
use strict;
use warnings;
my @strings = (
"____\n", "__ __\n", "__X __\n",
"__Z__\n", "__\n__\n", "__^__\n",
"_____\n", "________\n",
);
foreach my $string ( @strings ) {
print "<<$string>>\n";
if( $string =~ m/__(\S+?)__/ ) {
print "\tNon-Greedy -- Match: (($1)).\n";
}
else {
print "\tNon-Greedy -- No Match.\n";
}
if( $string =~ m/__(\S+)__/ ) {
print "\tGreedy -- Match: [[$1]].\n";
}
else {
print "\tGreedy -- No Match.\n";
}
}
Most of that is going to be pretty boring, until you get to the last item in the list, where you'll get the following output:
<<________
>>
Non-Greedy -- Match: ((_)).
Greedy -- Match: [[____]].
I have no idea whether non-greedy matching is going to have any practical effect in the type of strings you're matching with the regex though.
| [reply] [d/l] [select] |
|
|
Yes, but you forgot to put the trailing \n into the two regexes :-) If I put it in, that makes the last one match as well:
Non-Greedy -- Match: ((____)).
Greedy -- Match: [[____]].
Thanks for the replies guys. I'm about to mess with that code, but I was loathe to do that while I couldn't see why the ? had been included in the regex. I still don't see why it's there - but at least now I'm starting to feel a little confident that it serves no purpose. (I'll still probably leave it there ... because I'm feeling even more confident that it doesn't do any harm :-)
Cheers, Rob | [reply] [d/l] [select] |
|
|
but at least now I'm starting to feel a little confident that it serves no purpose. (I'll still probably leave it there ... because I'm feeling even more confident that it doesn't do any harm :-)
Its probably a reflex :) I know when I write regex I make more mistakes from greedines than from non-greediness, so I tend to write +? *? to be on the safe side
I know I'm not alone in getting bit by it , it is a frequent cause/solution from newbies
| [reply] |
Re: Puzzled by regex
by Athanasius (Archbishop) on Apr 10, 2013 at 04:26 UTC
|
When I saw the regex expression \S+?, my first thought was that this is equivalent to \S*. But it isn’t, as a little experimentation shows.
Consulting the Camel Book (4th Edition, page 214), I found that + means “1 or more times maximally” and +? means “1 or more times minimally.”
So, the difference between the two forms is not whether they match: if one matches, both must match. The difference lies only in what is matched, and this is relevant only if this part is captured (or, just possibly, if efficiency is an issue).
Hope that helps,
| [reply] [d/l] [select] |
|
|
The difference lies only in what is matched, and this is relevant only if this part is captured
Well ... the regex does capture that part but afaics, when both regexes match they both match the same thing. Do you have an example that demonstrates this difference ?
Just to be clear - I can see that /\S+?/ and /\S+/ could conceivably match differently, but I don't see how /__\S+?__\n/ and /__\S+__\n/ can match differently.
(And it's important to me that I do understand how they match differently if, indeed, they can.)
In case I'm guilty of not presenting the full picture, the regex (it's a split) as it appears in Inline.pm is actually:
@{$DATA{$pkg}} = split /(?m)(__\S+?__\n)/, $data;
Cheers, Rob | [reply] [d/l] [select] |
Re: Puzzled by regex
by Don Coyote (Hermit) on Apr 10, 2013 at 08:18 UTC
|
Random thoughts on match operating efficiency
The difference between the maximally matched quantifier (.+) - greedy, and the minimally matched quantifier (.+?) - nongreedy, in the case of the +(1 or more) quantifier is what is matched but more importantly, how, or from where, it is matched
In the maximal case the match position begins from eol and backtracks a position at a time and checks for the match, repeating till success or starting match position is reached
In the nongreedy case the operator match position starts from the starting match postion and forward-tracks a character at a time until success or eol
application of + quantifier behaviour to ? quantifier behaviour:
applying this to the ?(0 or 1) quantifier, I would expect the matching start position differs in the case of a greedy match starting at 1 position ahead, and in the nongreedy case starting at the starting match position.
Random summation:
The difference is not in what is matched, but how, or from where, the matching starts. This effectively increases the nongreedy match efficiency by the reduction of one jump ahead operation per usage.
Just Random:
I would imagine this will have been internally optimised, unless (or even especially if) there is perhaps a security benefit of a look forward match opposed to a look behind match
update later the same day
crumbs, +(0 or 1) quantifier, well that is incorrect. This '+' is the (1 or more) quantifier.
ok so to fix the above example i have replaced the '*' quantifiers with '+' quantifiers. And I have replaced the '+' quantifiers with '?' quantifiers, so at least what I wrote makes sense. Which it does despite the syntax errors now rectified.
After attempting to provide some examples where differences would be found, between the default greedy and nongreedy behaviour indicated by a secondary '?' quantifier, I realised that you are right, there are no differences in what is matched, when the '\n' are included, and in agreement with davidos and my own response, being the difference is in how the match is carried out.
| [reply] |
Re: Puzzled by regex
by Loops (Curate) on Apr 10, 2013 at 04:20 UTC
|
In the first regular expression you're using the '?' operator which says the previous character or group is optional. But you're applying it against the '+' operator which makes no sense, since it means one-or-more.
If instead you group the \S and + together and then apply the '?', you'll see different results:
use warnings;
my @str = ("____\n", "__ __\n", "__X __\n", "__Z__\n", "__\n__\n", "__
+^__\n");
for(@str) {
if($_ =~ /__(\S+)?__\n/) {print "1 "}
else {print "0 "}
if($_ =~ /__\S+__\n/) {print "1\n"}
else {print "0\n"}
}
__END__
1 0
0 0
0 0
1 1
0 0
1 1
So the answer is, Inline.pm just has a bug.
| [reply] [d/l] |
|
|
| [reply] [d/l] [select] |
|
|
David,
Thanks for correcting me. I was just dead wrong.
It's explained here: Matching Repititions, Down a bit it explains, "minimal match or non-greedy quantifiers ?? , *? , +?, and {}?"
Learn something new every day.
| [reply] |
|
|