First, I changed your grouping so I could see what was really
happening:
% perl -de 0
DB<1> x "ab" =~ /(a*)((?:ab)*|b*)/
0 'a'
1 ''
So Perl finds a match quite easily and gives up; fair enough.
Regexen are supposed to be greedy but that doesn't include
trying a second alternative (something after a veritcal bar)
to see if it can match a longer string that way. So lets
force Perl to be less lazy by forcing a match to the end
of the string:
DB<2> x "ab" =~ /(a*)((?:ab)*|b*)$/
0 'a'
1 'b'
Does that answer the question well enough for you?
-
tye
(but my friends call me "Tye") | [reply] [d/l] [select] |
You've found an issue that bites many programmers (including me). At first glance, the star, being greedy, slurps up everything. However, when you have an alternation, the regex will take the first successful match. Thus, the (ab)* will successfully match nothing and the regex is satisfied. If you reverse the (ab)* and (b)*, the star, being greedy, will match the "b". Try the following code:
$text = 'ab';
if ($text =~ /(a*)((?:ab)*|b*)/) {
print "'$1', '$2' \n";
}
if ($text =~ /(a*)(b*|(?:ab)*)/) {
print "'$1', '$2' \n";
}
The output is as follows:
'a', ''
'a', 'b'
Incidentally, Perl uses a traditional NFA engine for regex matching. If it used the POSIX-NFA engine or a DFA engine, your regex would work as you expect because those engines try to find the longest match that satisfies the regex. If you have experience with those engines, Perl may cause you some confusion.
Cheers,
Ovid
P.S. I'm glad to see you have a sense of humor about the flack you took :)
Update: Oops. dchetlin is right. I was typing too fast and didn't consider the DFA issue.
Join the Perlmonks Setiathome Group or just go the the link and check out our stats. | [reply] [d/l] [select] |
- Incidentally, Perl uses a traditional NFA engine for regex matching. If it used the POSIX-NFA engine or a DFA engine, your regex
would work as you expect because those engines try to find the longest match that satisfies the regex.
Weeeeell, kinda. A DFA engine in general can't do backreferences, so he wouldn't have been able to test the $1 thing to begin with...
You're right about the POSIX NFA, though. Of course, no one would use Perl RExen if they were POSIX NFAs, as they'd be too slow. But you knew that :-).
-dlc
| [reply] |
Just an addendum to what tye had said, in case someone who hasn't read Death to Dot Star! wanders in here.
The first alternative doesn't fail. It matches zero or more occurrences of 'ab' -- and, in fact, after the first 'a' in $text, there are exactly zero occurrences of 'ab'. So it matches.
Technically, it does look for as many matches as possible (it's greedy), but finding none, it's satisfied to match, successfully, none.
Tricky, eh? | [reply] |
$text = 'I like donkeys';
if ($text =~ /a*((ab)*|b*)/) { print "Success ($1)\n"; }
What you've written there is:
$text =~ / a* # zero or more of 'a'
( # followed by
(ab)* # zero or more of 'ab'
| # or
b* # zero or more of 'b'
)/x
So please, less of this 'perl is dead'. Go and read through perlre. | [reply] [d/l] [select] |