Re: Evolution, v2.0

Replies are listed 'Best First'.
Re2: Evolution, v2.0 by mstone (Deacon) on Dec 04, 2002 at 02:13 UTC
For a complete answer, find a good book about compilers and look up subset construction. The Dragon Book (aka: Compilers: Principles, Techniques and Tools) by Aho, Sethi, & Ullman is the canonical reference. In less formal terms, you track the states of an NFA by combining all the possible transitions each step of the way. If input character 'a' can take you to states B, C, and D, you then check B, C, and D to see if they have transitions for the next input character. In effect, the set {B,C,D} of nondeterministic states is itself a state in a deterministic finite automaton (DFA) that recognizes the same string. Processing that ever-growing list of states that might have transitions is nontrivial work. The traditional solution uses two stacks: stack one contains all the states you want to check for transitions, and whenever you find a valid transition you push the result state onto stack two. When stack one is finally empty, you advance to the next input character, swap stack one with stack two, and start over again. The process becomes even more fun when your NFA contains what are called epsilon transitions.. 'automatic' transitions that link two states without requiring any input. Epsilon transitions make it easier to model things like Kleene closures (the splat operator ('') in a regular expression) in ways that are (comparatively) easy to read: `a DFA that recognizes the string 'a': (s0)- a ->(s1) an NFA with epsilon transitions that recognizes the string 'a': +-->(s1)- a ->(s2)- e -+ \| \| +-->(s0)- e -+----------------------+-->(s3)- e -+ \| \| +------------------------------------------------+` [download] To find the set of states that might possibly accept the next input character, you have to follow all the epsilon transitions to states you can reach without having to accept any input. Then you have to follow any e-transitions those states might have, etcetera. In the diagram above, for instance, you have to follow three e-transitions before you can recognize the second 'a': one from s2 to s3, a second from s3 to s0, and a third from s0 to s1. The result is known as the e-closure of the original state. You also have to watch out for e-transitions that will take you in a loop.. from s0 to s3, and back to s0, for instance. During subset construction, you have to calculate the e-closure of a whole set of NFA states. You can do that as soon as you find a valid transition (if the transition is valid, push the e-closure of the result state onto stack two), or you can do it before looking for transitions (pop the next NFA state off stack one, find the e-closure, and check all those states for transitions). Either way, it takes some fiddling around. Both the process of building an e-closure and the process of searching for transitions break down nicely for a divide-and-conquer strategy, though. If a state is smart enough to find its own e-transitions (fairly trivial), it can then ask all its e-neighbors to report their own e-transitions. The same is true for checking transitions. Therefore, it makes sense to turn the states into objects that are smart enough to handle their own chunk of the calculation.	[reply] [d/l]

Replies are listed 'Best First'.

Re2: Evolution, v2.0
by mstone (Deacon) on Dec 04, 2002 at 02:13 UTC

For a complete answer, find a good book about compilers and look up subset construction. The Dragon Book (aka: Compilers: Principles, Techniques and Tools) by Aho, Sethi, & Ullman is the canonical reference.

In less formal terms, you track the states of an NFA by combining all the possible transitions each step of the way. If input character 'a' can take you to states B, C, and D, you then check B, C, and D to see if they have transitions for the next input character. In effect, the set {B,C,D} of nondeterministic states is itself a state in a deterministic finite automaton (DFA) that recognizes the same string.

Processing that ever-growing list of states that might have transitions is nontrivial work. The traditional solution uses two stacks: stack one contains all the states you want to check for transitions, and whenever you find a valid transition you push the result state onto stack two. When stack one is finally empty, you advance to the next input character, swap stack one with stack two, and start over again.

The process becomes even more fun when your NFA contains what are called epsilon transitions.. 'automatic' transitions that link two states without requiring any input. Epsilon transitions make it easier to model things like Kleene closures (the splat operator ('*') in a regular expression) in ways that are (comparatively) easy to read:

    a DFA that recognizes the string 'a':

    (s0)- a ->(s1)      
    
    
    an NFA with epsilon transitions that recognizes
    the string 'a*':
    
                 +-->(s1)- a ->(s2)- e -+
                 |                      |
    +-->(s0)- e -+----------------------+-->(s3)- e -+
    |                                                |
    +------------------------------------------------+
[download]

To find the set of states that might possibly accept the next input character, you have to follow all the epsilon transitions to states you can reach without having to accept any input. Then you have to follow any e-transitions those states might have, etcetera. In the diagram above, for instance, you have to follow three e-transitions before you can recognize the second 'a': one from s2 to s3, a second from s3 to s0, and a third from s0 to s1. The result is known as the e-closure of the original state. You also have to watch out for e-transitions that will take you in a loop.. from s0 to s3, and back to s0, for instance.

During subset construction, you have to calculate the e-closure of a whole set of NFA states. You can do that as soon as you find a valid transition (if the transition is valid, push the e-closure of the result state onto stack two), or you can do it before looking for transitions (pop the next NFA state off stack one, find the e-closure, and check all those states for transitions).

Either way, it takes some fiddling around.

Both the process of building an e-closure and the process of searching for transitions break down nicely for a divide-and-conquer strategy, though. If a state is smart enough to find its own e-transitions (fairly trivial), it can then ask all its e-neighbors to report their own e-transitions. The same is true for checking transitions. Therefore, it makes sense to turn the states into objects that are smart enough to handle their own chunk of the calculation.

[reply]
[d/l]