aeqr has asked for the wisdom of the Perl Monks concerning the following question:

I have learned today that you could force the regex engine to fail and backtrack, for example with this:
"ABCDEF" =~ /([A-Z]{3})(?{print "$1\n"})(?!)/;
trizen explained to me how this regex worked and this is what I understood:

First it matches ABC, then prints it, then fail, then backtracks to the position where the last match occured + 1.

Then when it runs again, it will match BCD this time. Thus this loop prints:

ABC BCD CDE DEF

However based on this reasoning, I struggle to understand the next example:

"ABCDEF" =~ /(\w{2,}?)(?{print "$1\n"})(?!)/;
Here it matches AB first then prints it, then fails then backtracks. But here, instead of matching BC as I would expect it, it matches ABC. So after failing, it does not backtrack to pos+1 like in the previous example. And the loop prints:
AB ABC ABCD ABCDE ABCDEF BC BCD BCDE BCDEF CD CDE CDEF DE DEF EF
Can someone explain me what is wrong in my assumption?

Replies are listed 'Best First'.
Re: regex backtracking question
by AnomalousMonk (Archbishop) on Apr 24, 2014 at 00:49 UTC

    Think of it this way:  \w{2,}? matches the minimum possible of two or more 'somethings', in this case  \w characters. On its first try it matches two characters (and prints them). The match is then intentionally 'failed', so  \w{2,}? gets another chance to match beginning at the point at which it previously attempted the match (not at that point + 1), this time matching three characters. And so on through four, five, ... Only when the regex engine runs out of  \w thingies to match is it forced to move the initial match point forward and try some more, starting again with two characters.

    Contrast with the behavior of  "ABCDEF" =~ /(\w{2,})(?{print "$1\n"})(?!)/; (no  ? quantifier modifier: greedy matching).

    In the original  "ABCDEF" =~ /([A-Z]{3})(?{print "$1\n"})(?!)/; case,  [A-Z]{3} can only match exactly three characters, so the regex engine must advance the trial match point upon failure; no backtracking (or forward-tracking, if that's even a correct term, in the case of lazy matching) is possible: exactly three characters and some other stuff were required to match at a given point; the match failed; move on — nothing to see here, folks.

    Update: Sorry: many small additions, wording changes, etc., which should not have altered essential meaning.

      Thanks for your answer, the thing I don't understand is how it makes the difference when the engine fails. In both cases, the match occurred. In the first case,how is it possible that it doesn't try to match ABC again? After all, it matched once, failed and thus gets another chance to match exactly 3 characters at the position it failed.

      Is it different because \w{2,}? has a sort of counter that remembers how long was the last attempt? And so it extends the match when it tries to run again because matching a longer chain is seen as a "new match that didn't occur here before"?

        The regex engine (RE) always tries to make the first overall match it can at the leftmost point in the string at which a match is possible. In the case of  \w{3} the leftmost point is the first character in the string, 'A' in this case. After taking exactly three characters, the RE tries to satisfy  (?!) (discounting the  (?{ code }) bit, which doesn't figure in to the match) to complete an overall match, which is impossible. The only choice the RE has is to advance the leftmost match point, in this case to 'B', and then try, try again.

        In the case of  \w{2,} (either lazy or greedy), the RE has other options: just take one more/give up one  \w character (if possible) and try again for an overall match, which fails, ... The "sort of counter that remembers" is the Non-deterministic Automaton backtracking mechanism of the NFA RE.

        Gotta go...

Re: regex backtracking question
by AnomalousMonk (Archbishop) on Apr 24, 2014 at 05:02 UTC

    BTW: If you like  (?!) and have access to Perl version 5.10+, check out Special Backtracking Control Verbs in perlre.  (?!) actually compiles to  (*FAIL) now (I think — or maybe it's the other way around).

Re: regex backtracking question
by trizen (Hermit) on Apr 24, 2014 at 10:59 UTC
    You can think of any regular expression in term of some nested for loops.

    Please, consider:
    for my $i (0 .. $eos) { for my $j ($min .. $max) { last if $j > $eos-$i; print substr($str, $i, $j),"\n"; } }
    The following variables specifies the meaning of "ABCDEF" =~ /(\w{2,}?)(?{print "$1\n"})(?!)/;
    my $min = 2; # minimum match length my $max = 32767; # maximum match length my $str = "ABCDEF"; my $eos = length($str);

    Using the same nested for loops, we have another specification for "ABCDEF" =~ /([A-Z]{3})(?{print "$1\n"})(?!)/;
    my $min = 3; # minimum match length my $max = 3; # maximum match length my $str = "ABCDEF"; my $eos = length($str);

    By executing the code, it gives us the same output as the corresponding regular expressions do. I hope this will give you a basic intuition on how backtracking works.
      Thanks for the additional info, I see why it is different now.