princepawn has asked for the wisdom of the Perl Monks concerning the following question:

Given this code,
$text = 'ab'; if ($text =~ /a*((ab)*|b*)/) { print $1,$/; }

I did not see anything printed as $1.

I had reasoned that a* would match the 'a' in the input text and that the first alternative /(ab*)/ would fail but the second alternative /b*/ would match the 'b' and thus $1 would have 'b', but my print statement did not print any value.

Any input guys, gals and/or other?, Alternative message title: "Alert! The Perl Regular Expression Engine is Broken! Fix Immediately!"

Replies are listed 'Best First'.
(tye)Re: Regular Expression matching question
by tye (Sage) on Oct 24, 2000 at 22:33 UTC

    First, I changed your grouping so I could see what was really happening:

    % perl -de 0 DB<1> x "ab" =~ /(a*)((?:ab)*|b*)/ 0 'a' 1 ''
    So Perl finds a match quite easily and gives up; fair enough. Regexen are supposed to be greedy but that doesn't include trying a second alternative (something after a veritcal bar) to see if it can match a longer string that way. So lets force Perl to be less lazy by forcing a match to the end of the string:
    DB<2> x "ab" =~ /(a*)((?:ab)*|b*)$/ 0 'a' 1 'b'
    Does that answer the question well enough for you?

            - tye (but my friends call me "Tye")
(Ovid) Re: Regular Expression matching question
by Ovid (Cardinal) on Oct 24, 2000 at 22:46 UTC
    You've found an issue that bites many programmers (including me). At first glance, the star, being greedy, slurps up everything. However, when you have an alternation, the regex will take the first successful match. Thus, the (ab)* will successfully match nothing and the regex is satisfied. If you reverse the (ab)* and (b)*, the star, being greedy, will match the "b". Try the following code:
    $text = 'ab'; if ($text =~ /(a*)((?:ab)*|b*)/) { print "'$1', '$2' \n"; } if ($text =~ /(a*)(b*|(?:ab)*)/) { print "'$1', '$2' \n"; }
    The output is as follows:
    'a', '' 'a', 'b'
    Incidentally, Perl uses a traditional NFA engine for regex matching. If it used the POSIX-NFA engine or a DFA engine, your regex would work as you expect because those engines try to find the longest match that satisfies the regex. If you have experience with those engines, Perl may cause you some confusion.

    Cheers,
    Ovid

    P.S. I'm glad to see you have a sense of humor about the flack you took :)

    Update: Oops. dchetlin is right. I was typing too fast and didn't consider the DFA issue.

    Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

      • Incidentally, Perl uses a traditional NFA engine for regex matching. If it used the POSIX-NFA engine or a DFA engine, your regex would work as you expect because those engines try to find the longest match that satisfies the regex.

      Weeeeell, kinda. A DFA engine in general can't do backreferences, so he wouldn't have been able to test the $1 thing to begin with...

      You're right about the POSIX NFA, though. Of course, no one would use Perl RExen if they were POSIX NFAs, as they'd be too slow. But you knew that :-).

      -dlc

RE: Regular Expression matching question
by chromatic (Archbishop) on Oct 24, 2000 at 22:39 UTC
    Just an addendum to what tye had said, in case someone who hasn't read Death to Dot Star! wanders in here.

    The first alternative doesn't fail. It matches zero or more occurrences of 'ab' -- and, in fact, after the first 'a' in $text, there are exactly zero occurrences of 'ab'. So it matches.

    Technically, it does look for as many matches as possible (it's greedy), but finding none, it's satisfied to match, successfully, none.

    Tricky, eh?

Re: Regular Expression matching question
by quidity (Pilgrim) on Oct 24, 2000 at 22:43 UTC

    The regular expression you have entered will match anything.

    $text = 'I like donkeys'; if ($text =~ /a*((ab)*|b*)/) { print "Success ($1)\n"; }

    What you've written there is:

    $text =~ / a* # zero or more of 'a' ( # followed by (ab)* # zero or more of 'ab' | # or b* # zero or more of 'b' )/x

    So please, less of this 'perl is dead'. Go and read through perlre.