Re: Checking whether a string is a prefix of any string matching a given regular expression

In terms of Perl regular expressions, I believe you're thinking about the matching process the wrong way. The way you're asking the question implies that the regular expression provides a road map for the engine to track along the input string, character for character, to determine a match. That's how you could "reach the end of the (input) string". However, the matching process is done the other way around: The regular expression engine tracks along the regular expression, using the input string as its "map".

I think that means the question you're asking about reaching the end of a prefix string can't really be answered, because the Perl regular expression engine never asks it. As an example, take the expression /^abc(?<=d)/, which can never match any string. If your input string starts with "abc", the Perl regex engine will track the expression successfully up to ^abc, moving all the way through the prefix "abc". Although the "end" of the prefix is reached, there is no way to realize that the full string cannot match without continuing to step through the regular expression.

The distinction of matching by moving along the input string vs. moving along the regular expression is the difference between a DFA regular expression engine and an NFA regular expression engine. My understanding is that the NFA engine is by far the most popular version, simply because the DFA cannot perform many of the more popular regular expression features, like backreferences or look-around.

Although some programs (like grep) have hybrid implementations that use both the DFA and the NFA technique, Perl I believe is exclusively NFA. Meaning that the engine does not, in fact, have the information necessary to answer your question, even for simpler regular expressions that omit backrefences or look-around.

The more interesting question might be whether Ruby can do this. I have no inkling what kind of engine it uses.

I highly recommend Mastering Regular Expressions by Jeffrey Friedl. Almost everybody comes away from it saying "Well, I thought I understood regular expressions, but I didn't." I also hope I got most of this right, and that the many wise monks out there can correct any mistake I might have made.

Comment on Re: Checking whether a string is a prefix of any string matching a given regular expression Download Code

Replies are listed 'Best First'.
Re^2: Checking whether a string is a prefix of any string matching a given regular expression by dbrockman (Initiate) on Jul 01, 2005 at 07:21 UTC
You say that the question of whether the end of the input string was reached cannot be answered because “the Perl regular expression engine never asks it.” I believe that it, in fact, does. It obviously tracks along both the pattern and the input string. How does it matter which is the “road map” and which is the “road”? Your example of trying to match the string `"abc"` against the pattern `/^abc(?<=d)/` is interesting. First of all, for the purpose of my original question, I don't consider this matching process to “reach the end of the input string.” I should have made this clear. For the engine to “reach the end” in the sense I intended, it would have to consume all input characters and then try to consume another one. That is, it has to care about a potential suffix. Maybe I should have said “bump into the end of the input” rather than the less evocative “reach the end of the input” (although I do believe I used the verb “hit” once). In your example, the engine would give up after consuming all three characters in the input string and seeing that the lookbehind fails. It would not try to consume another character and thus not “bump into” the end of the input string. So using the conjectured implication I described in my original post (i.e., string does not match and engine never bumped into the end ⇒ string is not a prefix of any matching string), I would predict that the input string is not the prefix of any matching string, and I would be correct—because, as you said, in this case there are no matching strings! I am not familiar with the DFA and NFA techniques and their relative merits, but frankly, I don’t see how those details are relevant. You say that Perl does not have the information necessary to answer my question, which makes me wonder what you think that information is. (Surely it must know at any given time which part of the input string it is working with, and whether it ever tried to read past the end of the input.) My reasoning is as follows: If the engine decides not to look at some suffix of the input, that can only mean that the match attempt has already failed and that the engine will then start backtracking. Hence, if there is some character in the input string that the engine never looks at, then the string certainly does not match (but not the other way around—when the engine looks at every character, the match can clearly either succeed or fail). From this it follows that if (a) the string does not match, and (b) the engine never tries to look past the end of the string, then adding characters to the end of the string cannot make it start matching (because the engine does not even look there). Does that make sense to you? Please let me know if you think I’m being unclear on some point.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Checking whether a string is a prefix of any string matching a given regular expression
by dbrockman (Initiate) on Jul 01, 2005 at 07:21 UTC

You say that the question of whether the end of the input string was reached cannot be answered because “the Perl regular expression engine never asks it.” I believe that it, in fact, does. It obviously tracks along both the pattern and the input string. How does it matter which is the “road map” and which is the “road”?

Your example of trying to match the string "abc" against the pattern /^abc(?<=d)/ is interesting. First of all, for the purpose of my original question, I don't consider this matching process to “reach the end of the input string.” I should have made this clear.

For the engine to “reach the end” in the sense I intended, it would have to consume all input characters and then try to consume another one. That is, it has to care about a potential suffix. Maybe I should have said “bump into the end of the input” rather than the less evocative “reach the end of the input” (although I do believe I used the verb “hit” once).

In your example, the engine would give up after consuming all three characters in the input string and seeing that the lookbehind fails. It would not try to consume another character and thus not “bump into” the end of the input string. So using the conjectured implication I described in my original post (i.e., string does not match and engine never bumped into the end ⇒ string is not a prefix of any matching string), I would predict that the input string is not the prefix of any matching string, and I would be correct—because, as you said, in this case there are no matching strings!

I am not familiar with the DFA and NFA techniques and their relative merits, but frankly, I don’t see how those details are relevant. You say that Perl does not have the information necessary to answer my question, which makes me wonder what you think that information is. (Surely it must know at any given time which part of the input string it is working with, and whether it ever tried to read past the end of the input.)

My reasoning is as follows: If the engine decides not to look at some suffix of the input, that can only mean that the match attempt has already failed and that the engine will then start backtracking. Hence, if there is some character in the input string that the engine never looks at, then the string certainly does not match (but not the other way around—when the engine looks at every character, the match can clearly either succeed or fail). From this it follows that if (a) the string does not match, and (b) the engine never tries to look past the end of the string, then adding characters to the end of the string cannot make it start matching (because the engine does not even look there).

Does that make sense to you? Please let me know if you think I’m being unclear on some point.

[reply]
[d/l]
[select]