in reply to Comparative satisfiability of regexps.

In "Mastering Regular Expressions" p. 156-157, there is a comparison of nondeterministic finite automaton (NFA) and determinisitic finite automaton (DFA):

"There could be a hundred ways to acheive the same result, but since the DFA keeps track of them all simultaneously (almost magically -- more on this later), it doesn't matter which form the regex takes. To a pure DFA, even expressions that appear as different as abc and [aa-a](b|b{1}|b)c are indistinguishable."

If you accept only regular expressions recognizable by a DFA (no captures to match later in the string, no embedded code) as your "mildly cripped subset" then your problem should be solveable.

Replies are listed 'Best First'.
Re^2: Comparative satisfiability of regexps.
by demerphq (Chancellor) on Jan 19, 2005 at 21:32 UTC

    Actually Friedl is making a point about how the engine would match, not that the state tables involved would be the same. IOW there is no guarantee that a DFA constructed from abc would have the same state transition tables as from [aa-a](b|b{1}|b)c.

    Instead of trying to build a DFA regex engine (a non trivial task) I'd suggest just writing some code that can normalize two regexes to the same representational format. If you can normalize two regexes to the same pattern then they are the same. Of course this doesnt help when it turns out that the receivers regex can accept a superset of the emitters pattern generator. Ie: the emitting regex could be /a/ and the receiving regex could be /a|b/ which means the receiving regex can handle anything that the emitting regex can produce but vice versa is not true. IMO if you figure that problem out you are well on your way to solving the halting problem.

    ---
    demerphq

      It is true that the two DFAs may not be exactly alike. However, any two DFAs can be proved equivalent (or not) in finite time, according to this article (in PowerPoint format).

      Of course, "finite" is not the same thing as "practical". The brute-force solution mentioned in that article is exponential (test all strings of length equal to the number of states). There is a second method that XORs the two DFAs that looks more promising.

      It looks like either method could be adapted to test whether one DFA accepts a subset of the other.

      Update: blokhead below gives the same solution in a more thoroughly-explained way.

      Any technology provably equivalent to a perpetual motion machine is junk. :-)

      Actually, what I'm trying to resolve is the "receivers regex can accept a superset of the emitters pattern generator" question. Specifically, whether that is equivalent to the halting problem or not.

      As I said in another note, I believe that DFAs require a memory strip to be Turing-complete, and thus analysis of regexps doesn't run into the halting problem. But I'm not certain of that, and my last Comp. Sci. course was very long ago.

      Resolving to canonical form is a good first step, but it's not entirely clear how to algorithmically determine that, for example, aaaab?aaaa is a subset of aaaaab*aaaaa. That one took a moment to prove and double-check in my head, and it's a *really* trivial case compared to some I'd like to verify.

      Brain-achingly,

      Mickey.

        Well if it isnt the the halting problem exactly i suspect its damn near as hard. How about these two:

        /aaaa(bc|bd|be|(b|b)(a(a(a(a)))|baba|aaaa)/ /aaaab?aaaa/

        I suspects its a lot easier to show that regex E can produce something that regex R wont accept, but saying for sure that R will accept all patterns producable by E isnt so easy. And IMO neither are approachable problems at all. Id be thinking of alternate solutions. One possibility is to produce something that generates random data from the E pattern and then hammer the R pattern with the products to see but obviously this wont be a proof, just a strong argument.

        /me wonders if he still has his data from regex code...

        ---
        demerphq

Re^2: Comparative satisfiability of regexps.
by Meowse (Beadle) on Jan 19, 2005 at 21:33 UTC
    *nod* Yours is the second pointer to "Mastering Regular Expressions" I've received, and theorbtwo said substantially the same thing (doable if the regexps are deterministic) in the Chatterbox earlier.

    *ponder* of course, any NFA can be mechanistically converted to a DFA (by creating a state for every reachable combination of NFA states), so really the answers should be equivalent. But since NFA/DFA's aren't actually Turing-complete (barring a Turing-machine-style memory tape), we don't run into the halting problem. Huh. Sounds like it should be solvable.

    But it's sounding like I may have to write the actual code myself; at least, nobody has spoken up and said "Oh, yes, that's been solved, and here's a link" yet. Which is, honestly, what I was hoping for. :-)

    Of course, even the fact that two deterministic regexps can be proven to be identical doesn't actually answer the original question. abc and ab[c|d] will have different DFA representations, but all strings that abc matches will also be matched by ab[c|d]. But it does suggest some approaches for resolving the issue.

    Anyone got more advice?

    Thanks 10e6,

    Mickey.