Regular Expression matching question

princepawn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(tye)Re: Regular Expression matching question by tye (Sage) on Oct 24, 2000 at 22:33 UTC
First, I changed your grouping so I could see what was really happening: `% perl -de 0 DB<1> x "ab" =~ /(a)((?:ab)\|b)/ 0 'a' 1 ''` [download] So Perl finds a match quite easily and gives up; fair enough. Regexen are supposed to be greedy but that doesn't include trying a second alternative (something after a veritcal bar) to see if it can match a longer string that way. So lets force Perl to be less lazy by forcing a match to the end of the string: `DB<2> x "ab" =~ /(a)((?:ab)\|b)$/ 0 'a' 1 'b'` [download] Does that answer the question well enough for you? - tye (but my friends call me "Tye")	[reply] [d/l] [select]
(Ovid) Re: Regular Expression matching question by Ovid (Cardinal) on Oct 24, 2000 at 22:46 UTC
You've found an issue that bites many programmers (including me). At first glance, the star, being greedy, slurps up everything. However, when you have an alternation, the regex will take the first successful match. Thus, the (ab)* will successfully match nothing and the regex is satisfied. If you reverse the (ab)* and (b), the star, being greedy, will match the "b". Try the following code: `$text = 'ab'; if ($text =~ /(a)((?:ab)\|b)/) { print "'$1', '$2' \n"; } if ($text =~ /(a)(b\|(?:ab))/) { print "'$1', '$2' \n"; }` [download] The output is as follows: `'a', '' 'a', 'b'` [download] Incidentally, Perl uses a traditional NFA engine for regex matching. If it used the POSIX-NFA engine or a DFA engine, your regex would work as you expect because those engines try to find the longest match that satisfies the regex. If you have experience with those engines, Perl may cause you some confusion. Cheers, Ovid P.S.* I'm glad to see you have a sense of humor about the flack you took :) Update: Oops. dchetlin is right. I was typing too fast and didn't consider the DFA issue. Join the Perlmonks Setiathome Group or just go the the link and check out our stats.	[reply] [d/l] [select]
(dchetlin: DFA) RE: (Ovid) Re: Regular Expression matching question by dchetlin (Friar) on Oct 25, 2000 at 01:20 UTC
Incidentally, Perl uses a traditional NFA engine for regex matching. If it used the POSIX-NFA engine or a DFA engine, your regex would work as you expect because those engines try to find the longest match that satisfies the regex. Weeeeell, kinda. A DFA engine in general can't do backreferences, so he wouldn't have been able to test the $1 thing to begin with... You're right about the POSIX NFA, though. Of course, no one would use Perl RExen if they were POSIX NFAs, as they'd be too slow. But you knew that :-). -dlc	[reply]
RE: Regular Expression matching question by chromatic (Archbishop) on Oct 24, 2000 at 22:39 UTC
Just an addendum to what tye had said, in case someone who hasn't read Death to Dot Star! wanders in here. The first alternative doesn't fail. It matches zero or more occurrences of 'ab' -- and, in fact, after the first 'a' in $text, there are exactly zero occurrences of 'ab'. So it matches. Technically, it does look for as many matches as possible (it's greedy), but finding none, it's satisfied to match, successfully, none. Tricky, eh?	[reply]
Re: Regular Expression matching question by quidity (Pilgrim) on Oct 24, 2000 at 22:43 UTC
The regular expression you have entered will match anything. `$text = 'I like donkeys'; if ($text =~ /a((ab)\|b)/) { print "Success ($1)\n"; }` [download] What you've written there is: `$text =~ / a # zero or more of 'a' ( # followed by (ab)* # zero or more of 'ab' \| # or b* # zero or more of 'b' )/x` [download] So please, less of this 'perl is dead'. Go and read through perlre.	[reply] [d/l] [select]