in reply to regex backtracking question

Think of it this way:  \w{2,}? matches the minimum possible of two or more 'somethings', in this case  \w characters. On its first try it matches two characters (and prints them). The match is then intentionally 'failed', so  \w{2,}? gets another chance to match beginning at the point at which it previously attempted the match (not at that point + 1), this time matching three characters. And so on through four, five, ... Only when the regex engine runs out of  \w thingies to match is it forced to move the initial match point forward and try some more, starting again with two characters.

Contrast with the behavior of  "ABCDEF" =~ /(\w{2,})(?{print "$1\n"})(?!)/; (no  ? quantifier modifier: greedy matching).

In the original  "ABCDEF" =~ /([A-Z]{3})(?{print "$1\n"})(?!)/; case,  [A-Z]{3} can only match exactly three characters, so the regex engine must advance the trial match point upon failure; no backtracking (or forward-tracking, if that's even a correct term, in the case of lazy matching) is possible: exactly three characters and some other stuff were required to match at a given point; the match failed; move on — nothing to see here, folks.

Update: Sorry: many small additions, wording changes, etc., which should not have altered essential meaning.

Replies are listed 'Best First'.
Re^2: regex backtracking question
by aeqr (Novice) on Apr 24, 2014 at 01:11 UTC
    Thanks for your answer, the thing I don't understand is how it makes the difference when the engine fails. In both cases, the match occurred. In the first case,how is it possible that it doesn't try to match ABC again? After all, it matched once, failed and thus gets another chance to match exactly 3 characters at the position it failed.

    Is it different because \w{2,}? has a sort of counter that remembers how long was the last attempt? And so it extends the match when it tries to run again because matching a longer chain is seen as a "new match that didn't occur here before"?

      The regex engine (RE) always tries to make the first overall match it can at the leftmost point in the string at which a match is possible. In the case of  \w{3} the leftmost point is the first character in the string, 'A' in this case. After taking exactly three characters, the RE tries to satisfy  (?!) (discounting the  (?{ code }) bit, which doesn't figure in to the match) to complete an overall match, which is impossible. The only choice the RE has is to advance the leftmost match point, in this case to 'B', and then try, try again.

      In the case of  \w{2,} (either lazy or greedy), the RE has other options: just take one more/give up one  \w character (if possible) and try again for an overall match, which fails, ... The "sort of counter that remembers" is the Non-deterministic Automaton backtracking mechanism of the NFA RE.

      Gotta go...

        Ok, I got it, thanks for the explanation.