GrandFather has asked for the wisdom of the Perl Monks concerning the following question:

I wanted a regex to match milestone version numbers where the version number could have two or three numbers and an optional milestone suffix. The following test code partially works:

for my $entry (qw(1_0 1_0_1 1_0_beta 1_0_1_beta)) { print "$entry\n" if $entry =~ /^\d+ _\d+ (?> (?:_\d+)? \w+) $/x; }

Prints:

1_0_1 1_0_beta 1_0_1_beta

The 1_0_1 is unexpected (and unwanted). I would have thought the no backtracking section should prevent matching the trailing _1 because (?:_\d+) would match and the backtracking suppression would then cause the \w+ part of the match to fail. What have I misunderstood?

True laziness is hard work

Replies are listed 'Best First'.
Re: Unexpected regular expression match
by ikegami (Patriarch) on Jan 26, 2012 at 00:24 UTC

    You seem to think backtracking cannot occur within (?>PAT). That's not the case at all.

    The purpose of (?>PAT) is to prevent the regex engine from trying to get PAT to match something else once it has already matched something. In short, backtracking through (?>PAT) causes it to fail.

    You want

    /^\d+ _\d+ (?> (?:_\d+)? ) \w+ $/x

    which can also be written as

    /^\d+ _\d+ (?:_\d+)?+ \w+ $/x

    For "1_0_1" =~ /^\d+ _\d+ (?> (?:_\d+)? ) \w+ $/x, everything is straightforward until /\w+/ fails to match. At that point, the regex engine starts to backtrack.

    1. Backtracking through causes it (?>...) fail (as always).
    2. Backtracking through causes it \d+ fail (since it previously only matched only one digit).
    3. Backtracking through causes it _ fail (as always).
    4. Backtracking through causes it \d+ fail (since it previously only matched only one digit).
    5. Backtracking through causes it ^ fail (as always).
    6. The match fails.

      Once one backtracks through (?>PAT), the regex engine is free to try to match PAT at a different location (or maybe even at the same location) if backtracking ended successfully earlier in the pattern.

      This causes my proposed solution to fail for "2_34_5".

      In theme, this can be fixed by preventing backtracking through the early \d+.

      /^\d++ _ \d++ (?:_ \d+)?+ \w+ $/x

      One can also solve this without any (?>...) at all by being more precise with the definitions.

      /^\d+ _ \d+ (?:_ \d+)? (?![\d_])\w+ $/x

      Thank you, that makes sense. I did indeed think backtracking could not occur within (?>PAT). This may be the first time I've tried to use (?>PAT) and I can't say that reading the documentation really helped understand where backtracking was being suppressed.

      The quantifier+ (possessive quantifier) syntax is new to me. Any idea when it was introduced (5.10 maybe)? The phrase "give nothing back" in the documentation makes the possessive quantifier (and by implication (?>PAT) ) much easier to understand in my view.

      True laziness is hard work

        I'll make sure to use "give nothing back" in the future.

        5.10.1 did have it, so yeah, it was surely introduced in 5.10.0

Re: Unexpected regular expression match
by i5513 (Pilgrim) on Jan 25, 2012 at 23:37 UTC
    Updated: Well I really miss something which I should study. My original wrong answer:

    Why do you expect it doesn't match ? '\w' includes '\d' and '_' .. and you have a '?' which make optional the (?: _\d+) group