redbull2012 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have some doubts about RegEx engine in Perl. The thing is, i know that there are fundamentally 2 RegEx engine NFA and DFA. The source of my info is in below link, i hope its not obsolete yet: master regular expression pdf - Chapter 4. So according to that, Perl is based on NFA engine. But when i check below matching:

$_ = "The first recorded efforts to reach Everest's summit were made b +y British mountaineers "; /summit|Everest|mountain/; print $&; #Result is "Everest"

For NFA the engine will move the control over the RegEx, encounter with the alternation, the engine will check it in turn, so "summit" is checked first and matched. At this point overall match is achieved and the engine should stop but no. The result is most likely from the DFA engine as it will move the control over the target text. First the engine will look into the target string "The first ....", since it find the "Everest" first and the RegEx satisfies that -> Overall match achived! Even if the engine will choose the leftmost match. Then "summit" should be evaluated once? So I try to use a second approach:

$_ = "The first recorded efforts to reach Everest's summit were made b +y British mountaineers "; /summit(?{print "11"})|Everest(?{print "22"})|mountain(?{print "33"})/ +; print $&; #22Everest

The result shows that "summit" is not evaluated! Did i misunderstand something here about the "alteration" or about the RegEx engine? Or Perl is not NFA anymore but Hybrid NFA + DFA? Dear Monks, please enlighten me.

Thanks in advance!

Replies are listed 'Best First'.
Re: Is RegEx in Perl not NFA anymore?
by dasgar (Priest) on Oct 19, 2016 at 18:41 UTC

    That PDF link in your post looks like it could be for the full (or partial) copy of a published book. I'm not a lawyer, but I believe that posting that kind of link may not be compliant with copyright laws.

    Look at Chapter 4 that describes NFA and DFA engines. A short way into that chapter, there is a section called "Match Basics" that covers what is common with both engines. In particular, look at the first rule about earliest match wins. It describes that the regex engine looks at the beginning of the string to apply the regex there. If it doesn't match, it moves to the next character and tries the regex again. Both NFA and DFA will do this.

    If you install Regexp::Debugger, you can try your sample text and regex to walk through what the regex engine is doing. You'll see that it is applying that first match wins rule to your string.

    I could be wrong, but I believe that your sample text and regex really isn't a good example to use to determine if a regex engine is NFA or DFA.

      Thanks a lot for your reminder. I have updated it into a google search

        I have updated it into a google search

        For the PDF of the book. Why not just link to ISBN 9780596528126 or the O'reilly page?

        Principle of Least Astonishment: Any language that doesn’t occasionally surprise the novice will pay for it by continually surprising the expert

Re: Is RegEx in Perl not NFA anymore?
by Laurent_R (Canon) on Oct 19, 2016 at 20:20 UTC
    What you have in your example is a simple "left-most match wins" behavior, which is common to both NFA and DFA engines (and their variants): the match that begins earliest wins.

    In Perl 6, you have two types of alternations (with || and |): one in which the order of the regex alternates matter, and one in which the longest match wins, but, even then, that applies only to matches that start on the same atom; you still have with both alternation variants the overriding rule that it is the match that begins earliest which wins.

    Perl (and PCRE) has had an NFA engine for very long, and I don't think there has been any change to that.

    The mere fact that you can have lazy (or frugal or non-greedy) quantifiers is an almost certain proof that it is a traditional NFA engine, because they are just not possible with a DFA engine.

Re: Is RegEx in Perl not NFA anymore?
by perldigious (Priest) on Oct 19, 2016 at 19:23 UTC

    First a quick side note from perlvar regarding your use of $&.

    Traditionally in Perl, any use of any of the three variables $` , $& or $'... imposed a considerable performance penalty... so generally the use of these variables has been discouraged.

    So it's probably better to use capturing parenthesis and just get in to the habit of not using the three variables above. That said, I was perplexed by the behavior shown for your first match attempt as well. I tried the following out of curiosity.

    $_ = "The first recorded efforts to reach Everest's summit were made b +y British mountaineers "; /(summit|Everest|mountain)/; print "$1\n"; /(summit|mountain)/; print "$1\n"; /(Everest|mountain|summit)/; print "$1\n";

    Which gave the following output.

    Everest summit Everest

    This indicated to me that the order of the alternatives separated by | likely doesn't matter. Taking dasgar's other advice and trying Regexp::Debugger shows exactly what the regex engine is doing. Stepping through character by character, starting from the left, and looking right from there to see if any of the | separated alternatives have a complete match. It does check the leftmost alternative first, but it doesn't go all the way through the string with it before checking the next alternative. So in other words, "Everest" gets matched because it comes first (is leftmost) in the string.

    Unfortunately, I don't know if perl's regex engine is NFA or DFA (or some hybrid), or even what the difference is since this is the first time I've heard those terms used.

    UPDATE:

    Down a rabbit Google hole perldigious went. Back he came with more questions and no answers... and also a bit of a headache.

    Hacker News
    Russ Cox paper He is sort of scolding Perl, Python, and Ruby, but as far as I can tell he likes Go and spares it from the same analysis... hmm.
    ars technica
    StackOverflow

    I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
    I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
      ... the order of the alternatives separated by | likely doesn't matter.

      Further to Laurent_R's ++post: The ordered alternation "first match wins" behavior of Perl 5 can be seen here:

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'xxxABCDEyyy'; ;; print qq{captured '$1'} if $s =~ m{ (ABC|ABCD|ABCDE) }xms; " captured 'ABC'


      Give a man a fish:  <%-{-{-{-<

        Yes, AnomalousMonk, that's exactly it.

        Just for the sake of completeness, these are the two variants of alternations in Perl 6, adapting your example:

        > my $s = 'xxxABCDEyyy'; xxxABCDEyyy > say "captured $/" if $s ~~ / ABC | ABCD | ABCDE /; captured ABCDE > say "captured $/" if $s ~~ / ABC || ABCD || ABCDE /; captured ABC
        With a single pipe, the longest match wins (which means, BTW, that the engine must try all possibilities to figure out which is the best); with a double pipe, the first match wins.

        But that holds only if all possible matches start on the same atom:

        > say "captured $/" if $s ~~ / xABC | ABCD | ABCDE /; captured xABC
        Here, even though "ADCDE" is a longer match, "xABC" wins because it starts earlier in the string.

        Indeed, thanks AnomalousMonk, that was poor wording on my part. I should have added an "in this case" or something similar at the end since I meant that statement to apply to the specific alternatives being checked for a match against the specific string given, not a general statement to say "the order of such alternatives never matters". I do go on to explain it checks the leftmost alternative first.

        For some reason I originally thought it would drag the first (leftmost) alternative through the entire string before trying the second alternative and so on. Probably a silly thing to have ever thought, especially when considering groups of alternatives contained within more complicated regular expressions.

        I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
        I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
Re: Is RegEx in Perl not NFA anymore?
by Erez (Priest) on Oct 20, 2016 at 08:06 UTC

    The perl regex algorithm/engine is neither DFA nor NFA. The description in Friedl's book is incorrect, and he later have explained that in his website: On the Terms “NFA,” “DFA,” and “Regular Expression”.
    In a nutshell, he have used "NFA" to describe regex implementations that have backtracking or backreference, such as /(foo)\1/ matchin "foofoo".

    The Perl engine (or almost any other regex implementation out there) cannot be any type of "FA" as it uses backtracking, either recursively or iteratively.

    Principle of Least Astonishment: Any language that doesn’t occasionally surprise the novice will pay for it by continually surprising the expert

      The perl regex algorithm/engine is neither DFA nor NFA.
      Yeah, Erez, it's good to point this out (++). By strict standards, it is true that the Perl regex engine is neither NFA, nor DFA. In fact, the Perl regex engine is not even doing regular expressions (at least not in the sense of academic regular expressions, as they were defined some 70 years ago). Perl is using highly irregular expressions. And even more so with Perl 6 (which explicitly states it is using the "regex" term, rather than "regular expressions", because it is even further away from the original regular expressions, even if it broadly works according to the same principles).

      Yet, if we are to accept the broader sense of regexes as usually meant nowadays in common programming languages, and more broadly in practical computing, and coming back to the specific question in the original post, I think it can be said that the Perl regex engine is broadly much closer to NFA than to DFA. That's what I meant in my previous post, only that, I did not mean to go in the details of theoretical distinctions which are probably very important for research but of relatively little significance for practical computing.

      This is the description from his book so I would say he clarified the issue, not corrected–

      The true mathematical and computational meaning of “NFA” is different from what is commonly called an “NFA regex engine.” In theory, NFA and DFA engines should match exactly the same text and have exactly the same features. In practice, the desire for richer, more expressive regular expressions has caused their semantics to diverge.