stratkid has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I'm a pathetic newbie trying to slog through the "Camel", with reasonable success so far. Anyway, I cannot figure out one of the examples. From the book: "The we match any number of characters with .* - but we place another code subpattern in between the . and the * so we can count how many times . matches. Here is the example:

#!/usr/bin/perl $_ = "lothlorien"; m/ (?{ $i = 0 }) (. (?{ $i++ }) )* lori /x;
$i=10?

So, the increment of $i is right in between the . and the quantifer, * - but $i=10? If it increments whenever the . matches, then I thought $i should be 7. The '.*' would match the whole string, then back up one at a time till the rest of the pattern could match the 'lori' part.

Can someone enlighten me?
PS. The "code sub-pattern" is zero-width; is this why it can sit in a pair of paratheses that is being quanified (by the *). This seems very weird to me. ~stratkid

Replies are listed 'Best First'.
Re: code within a regular expression
by samtregar (Abbot) on May 02, 2002 at 04:11 UTC
    Why would you expect $i to be 7? You didn't say. I would expect $i to be 10. The reason is that .* will match every character it comes across. The * is a "greedy" quantifier. This means that .* matches all the way to the end of the string, incrementing $i ten times. After that the engine starts backtracking trying to complete the match, eventually taking 6 characters away from .*, but never executing the .* section again.

    Of course, this is a vast simplification. If you're interested in truely understanding regular expressions, I suggest you pick up a copy of "Mastering Regular Expressions" from O'Reilly Press. The Camel Book, for all its value, is not nearly enough to allow you to fully grok the regular expression in all its glory.

    -sam

    Side note: I wonder if the code above will always set $i to 10. Couldn't some future regex optimization intuit the limit on the .* section and end up setting $i to 4?

      but your explanation was succinct, so now I understand...

      My rational was that I thought the first greedy match for .* (the whole string) would be one match. Then it'd be forced to backup a letter at a time all the way to the the "h" (so 'lori' could match too), and each time would also be another match, thus equaling 7.

      Now if I can ever figure out what the hell dynamic/static scoping is. :-)
      ~stratkid

(crazyinsomniac) Re: code within a regular expression
by crazyinsomniac (Prior) on May 02, 2002 at 03:04 UTC