Re: Is RegEx in Perl not NFA anymore?
by dasgar (Priest) on Oct 19, 2016 at 18:41 UTC
|
That PDF link in your post looks like it could be for the full (or partial) copy of a published book. I'm not a lawyer, but I believe that posting that kind of link may not be compliant with copyright laws.
Look at Chapter 4 that describes NFA and DFA engines. A short way into that chapter, there is a section called "Match Basics" that covers what is common with both engines. In particular, look at the first rule about earliest match wins. It describes that the regex engine looks at the beginning of the string to apply the regex there. If it doesn't match, it moves to the next character and tries the regex again. Both NFA and DFA will do this.
If you install Regexp::Debugger, you can try your sample text and regex to walk through what the regex engine is doing. You'll see that it is applying that first match wins rule to your string.
I could be wrong, but I believe that your sample text and regex really isn't a good example to use to determine if a regex engine is NFA or DFA.
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
Re: Is RegEx in Perl not NFA anymore?
by Laurent_R (Canon) on Oct 19, 2016 at 20:20 UTC
|
What you have in your example is a simple "left-most match wins" behavior, which is common to both NFA and DFA engines (and their variants): the match that begins earliest wins.
In Perl 6, you have two types of alternations (with || and |): one in which the order of the regex alternates matter, and one in which the longest match wins, but, even then, that applies only to matches that start on the same atom; you still have with both alternation variants the overriding rule that it is the match that begins earliest which wins.
Perl (and PCRE) has had an NFA engine for very long, and I don't think there has been any change to that.
The mere fact that you can have lazy (or frugal or non-greedy) quantifiers is an almost certain proof that it is a traditional NFA engine, because they are just not possible with a DFA engine.
| [reply] |
Re: Is RegEx in Perl not NFA anymore?
by perldigious (Priest) on Oct 19, 2016 at 19:23 UTC
|
First a quick side note from perlvar regarding your use of $&.
Traditionally in Perl, any use of any of the three variables $` , $& or $'... imposed a considerable performance penalty... so generally the use of these variables has been discouraged.
So it's probably better to use capturing parenthesis and just get in to the habit of not using the three variables above. That said, I was perplexed by the behavior shown for your first match attempt as well. I tried the following out of curiosity.
$_ = "The first recorded efforts to reach Everest's summit were made b
+y British mountaineers ";
/(summit|Everest|mountain)/;
print "$1\n";
/(summit|mountain)/;
print "$1\n";
/(Everest|mountain|summit)/;
print "$1\n";
Which gave the following output.
Everest
summit
Everest
This indicated to me that the order of the alternatives separated by | likely doesn't matter. Taking dasgar's other advice and trying Regexp::Debugger shows exactly what the regex engine is doing. Stepping through character by character, starting from the left, and looking right from there to see if any of the | separated alternatives have a complete match. It does check the leftmost alternative first, but it doesn't go all the way through the string with it before checking the next alternative. So in other words, "Everest" gets matched because it comes first (is leftmost) in the string.
Unfortunately, I don't know if perl's regex engine is NFA or DFA (or some hybrid), or even what the difference is since this is the first time I've heard those terms used.
UPDATE:
Down a rabbit Google hole perldigious went. Back he came with more questions and no answers... and also a bit of a headache.
Hacker News
Russ Cox paper He is sort of scolding Perl, Python, and Ruby, but as far as I can tell he likes Go and spares it from the same analysis... hmm.
ars technica
StackOverflow
I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
| [reply] [d/l] [select] |
|
|
c:\@Work\Perl\monks>perl -wMstrict -le
"my $s = 'xxxABCDEyyy';
;;
print qq{captured '$1'} if $s =~ m{ (ABC|ABCD|ABCDE) }xms;
"
captured 'ABC'
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|
> my $s = 'xxxABCDEyyy';
xxxABCDEyyy
> say "captured $/" if $s ~~ / ABC | ABCD | ABCDE /;
captured ABCDE
> say "captured $/" if $s ~~ / ABC || ABCD || ABCDE /;
captured ABC
With a single pipe, the longest match wins (which means, BTW, that the engine must try all possibilities to figure out which is the best); with a double pipe, the first match wins.
But that holds only if all possible matches start on the same atom:
> say "captured $/" if $s ~~ / xABC | ABCD | ABCDE /;
captured xABC
Here, even though "ADCDE" is a longer match, "xABC" wins because it starts earlier in the string.
| [reply] [d/l] [select] |
|
|
Indeed, thanks AnomalousMonk, that was poor wording on my part. I should have added an "in this case" or something similar at the end since I meant that statement to apply to the specific alternatives being checked for a match against the specific string given, not a general statement to say "the order of such alternatives never matters". I do go on to explain it checks the leftmost alternative first.
For some reason I originally thought it would drag the first (leftmost) alternative through the entire string before trying the second alternative and so on. Probably a silly thing to have ever thought, especially when considering groups of alternatives contained within more complicated regular expressions.
I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
| [reply] |
Re: Is RegEx in Perl not NFA anymore?
by Erez (Priest) on Oct 20, 2016 at 08:06 UTC
|
The perl regex algorithm/engine is neither DFA nor NFA. The description in Friedl's book is incorrect, and he later have explained that in his website: On the Terms “NFA,” “DFA,” and “Regular Expression”.
In a nutshell, he have used "NFA" to describe regex implementations that have backtracking or backreference, such as /(foo)\1/ matchin "foofoo". The Perl engine (or almost any other regex implementation out there) cannot be any type of "FA" as it uses backtracking, either recursively or iteratively.
Principle of Least Astonishment: Any language that doesn’t occasionally surprise the novice will pay for it by continually surprising the expert
| [reply] [d/l] |
|
|
The perl regex algorithm/engine is neither DFA nor NFA.
Yeah, Erez, it's good to point this out (++). By strict standards, it is true that the Perl regex engine is neither NFA, nor DFA. In fact, the Perl regex engine is not even doing regular expressions (at least not in the sense of academic regular expressions, as they were defined some 70 years ago). Perl is using highly irregular expressions. And even more so with Perl 6 (which explicitly states it is using the "regex" term, rather than "regular expressions", because it is even further away from the original regular expressions, even if it broadly works according to the same principles).
Yet, if we are to accept the broader sense of regexes as usually meant nowadays in common programming languages, and more broadly in practical computing, and coming back to the specific question in the original post, I think it can be said that the Perl regex engine is broadly much closer to NFA than to DFA. That's what I meant in my previous post, only that, I did not mean to go in the details of theoretical distinctions which are probably very important for research but of relatively little significance for practical computing.
| [reply] |
|
|
| [reply] |