Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Re (tilly) 3: Research ideas

by gryng (Hermit)
on Feb 20, 2001 at 05:50 UTC ( [id://59560]=note: print w/replies, xml ) Need Help??


in reply to Re (tilly) 3: Research ideas
in thread Research ideas

Ok's, I've gotchas now tilly.

I understand the difference between NFA's and DFA's, in fact I think it would be nice if you could have a DFA-preference modifier in Perl on your regex's so that if they are DFA-compatible, they would use the DFA instead of an NFA. (or even as egrep does, -pretend- they are DFA compatible, then check your answer with an NFA).

I'm no wiz with regex's though, but what you propose seems very, very difficult! At least, it seems that if you told it to do even something simple like: /vr(o*)*m/ (which if I'm thinking correctly would cause a normal NFA to go ga-ga), you would be trying to get the regex engine to realize first that perhaps we should either ignore the double *, or (as it would be in more complicate situations) reverse the processing order of the *'s, or even something more 'odd'.

And so my point is, even if you can get the regex engine to realize it can do something to avert catastrophe, couldn't many of the 'solutions' that the regex decides upon change which match the NFA chooses? That is, changed from what the original badly written form asked for? The whole intent on using an NFA is so you can choose optimized or taylored regex's that either give you speed or choice in your matches (and sometimes both :) ). But wouldn't this sort of optimization make which match the NFA picks sort of, non-obvious? (in a more non-obvious than normal for regex's way)?

Again, I'm not an expert regex'er, but it seems using a DFA in these sorts of situations (even if a DFA can't be used directly), would be similar -- you would avoid the pitfalls of killing the performance, but you suffer in that maybe you're not getting the match you expected.

Well, I'm probably brain-fried right now, so don't pay any attention to me :) .

Ciao,
Gryn

p.s. for those that are in the dark the main difference between a NFA and a DFA is one of performance and control.

A DFA takes almost always the same amount of time to make a match, and it will always make the same one for equivalent regex's (the longest of the leftmost (er, or the otherway around, sorry)).

A NFA is the opposite, it may (often) be much fater than a DFA, but it also may get stuck in a rut and finish sometime after the Sun burns out. Besides the speed improvement, you often get a control advantage. That is, for equivalent regex's (ones that would match the same thing, but don't look exactly the same because of reordering or what-not) you can control which match you want by rearranging your regex (which can't be done for DFA's).

Replies are listed 'Best First'.
Re (tilly) 5: Research ideas
by tilly (Archbishop) on Feb 20, 2001 at 10:12 UTC
    First of all you are right, the regex /vr(o*)*m/ makes a naive NFA go gaga into an infinite loop. In fact I believe that Perl used to break on that exact case.

    Contrary to a naive reading, this is not the same as the match vr(o*)m, in fact it gives the same result as /vro*()m/ (that is $1 winds up matching an empty string before the m).

    The solution currently used is to keep track of when you start following the same logic during a zero-length match and refuse to go down the same path a second time. So before the m it will try to match another o, cannot, then tries to go around the loop and match o*, succeeds, then tries to go around the loop, detects recursion and fails back to trying to match the m, and succeeds. (Thereby finding your empty match.) A naive optimization will get this wrong.

    My proposed optimization has to be done carefully, but will get it right. What I am suggesting is that you first figure out all of the possible transitions depending on what the next character is, in the order an NFA will visit them, remove the duplicates, and then toss that into a psuedo-state. In figuring out the possible transitions you need to get the zero-length recursion right. However if you do that correctly, then at no point does my optimization change what match the RE engine will find. It just allows you to "parallelize" the reasoning process and thereby eliminate a possible spot to backtrack to.

    Now for those who are horribly confused, grab Mastering Regular Expressions. In a nutshell the basic problem is this. We have a regular expression. We have a string. A DFA engine (which stands for "discrete finite automata") essentially walks the string once, keeping track on each character of what all of the possible places it could be in the RE match. This gives good guaranteed performance, but it makes it hard to figure out what captured parentheses should be, or to implement backreferences (which need to keep track of more information than just where you are in the string and the RE). An NFA engine just does a recursive brute-force search of possible ways to try to match the string to the RE. This allows for lots of features (for instance the difference between .* and .*? is in whether you try to match another . first or second) but you can get into cases where there is an exponential explosion in how much work needs to be done to recognize that a string does not match. Perl, and most other scripting languages, use NFA engines. A few use DFAs.

      By twisting my brain into the shape of a donut, I think I have seen a glimpse of what you mean. Let me see if I have it right:

      You propose to turn a regular expresion into a ordered set of states, one for each character position (or possibly no character as well?). This is so that you can go through each atom of each set once (that is, no backtracking, which is the bane of an NFA). This seems (-seems- to someone who has never written an NFA or a DFA before) very similar to a DFA, except that a DFA would not have the order preference that your ordered sets would have, they would just have sets.

      The consequence to that difference is that with a DFA you can go through each set in parallel, but with the proposed NFA change, you would still have to recurse backwards and forwards. This is -similar- to backtracking, but seems better, because you are going through a finite number of (ordered) sets, and not a possibly geometrically expanding number of states.

      I may have muddled the nomenculture a bit, sorry :)

      Ciao,
      Gryn

        Sort of, only not so complicated. :-)

        An NFA is already compiled internally into a series of states with transitions between them. What I am proposing to do is take that representation and rewrite part of it using new states that are ordered subsets of the original.

        There are two tricks here.

        The more obvious one is summed up by the phrase, "Put off figuring out why you have done this work as long as possible." That is, if you are going to have to backtrack and then go forward with identical logic, why not make it so that you go forward following common logic as long as possible? This is next character discrimination, for instance an RE to match all of the states would be faster because if you saw an A first you would have already figured out that you just need to look at states whose name starts with A.

        The more subtle and important one is that if you are aggressive about it, you can eliminate a lot of repeated following of the same reasoning. Basically if ever you arrive at the same place in the string, with the same state information relevant to your match, at the same position in your RE, then the outcome is going to be the same. With a traditional NFA you might arrive at a state, go forward, fail, backtrack. Then you follow a different line of reasoning forward, reach the same place, and of course you will do the same work to figure out that you fail again. This is all repeated useless work, you already did this stuff.

        Alternately if it succeeds the first time, it will not backtrack and again it will be irrelevant that there is another way to the same state.

        Either way, the *second* way to the same combination is irrelevant. So in the process of optimizing if you see that you can reach the same state in multiple lines of reasoning can safely drop the second way of reaching it. By explicitly parallelizing the reasoning process that the NFA will use you can cause it to realize that 2 chains of reasoning will come back to the same state, and so the ordered sets of states can omit duplicates. And that is why you can avoid the combinatorial backtracking disaster!

        OTOH there is no requirement to optimize the whole RE in this way. In fact when you are capturing a backreference used elsewhere in the RE, then state information relevant to the global match is being captured and you explicitly cannot apply this optimization. You might still optimize bits and pieces, but you won't be able to find a simple and efficient way to match crazy things like /(.*).*\1/...

Re: Re: Re (tilly) 3: Research ideas
by demerphq (Chancellor) on Mar 22, 2002 at 16:37 UTC
    A NFA is the opposite, it may (often) be much faster than a DFA,

    I assume you mean this in the context of including init times? Excluding the DFA table build phase the DFA should be faster or equivelant to a NFA.

    And I would absolutely welcome a DFA engine in addition to the existant NFA engine in perl. Especially if the DFA engine had a way of saving its state table. The possibilities are endless....

    Yves / DeMerphq
    ---
    Writing a good benchmark isnt as easy as it might look.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://59560]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2024-03-28 09:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found