in reply to Perl regexp matching is slow??

The stuff in that article is anything but not new. For a comprehensive comparison of NFA's and DFA's check out Mastering Regular Expressions.


holli, /regexed monk/

Replies are listed 'Best First'.
Re^2: Perl regexp matching is slow??
by rsc (Acolyte) on Jan 31, 2007 at 01:59 UTC
    You won't find anything useful about NFAs and DFAs in Friedl's book. What's in Friedl's book about that topic is misleading at best. Friedl freely admits that he doesn't care for the underlying theory.

    That's why I wrote the article.

      Hi Russ,

      I think it's a bit unfortunate how you phrased that response ("nothing useful about NFAs and DFAs"), since it doesn't make clear that you're talking about the theoretical potential of the techniques (which, indeed, my book does not cover). As such, it sounds like an attack rather than the simple citing of a fact.

      You had it right in your (excellent) article:

      Friedl's book teaches programmers how best to use today's regular expression implementations, but not how best to implement them.

      I think a better response to Holli would be one noting that there's more potential offered by the underlying theory than is evident in the common implementations covered in the book.

      Whether the underlying theory is my cup of tea is not particularly relevant to my book, because the reality of today's common implementations is that they are what they are. Perhaps you wouldn't have so much angst about my book if it were titled Mastering (what goes for) Regular Expressions ? :-)

      In your article, I'm not sure what you mean when you say that I don't "respect" theory (?), but it's true that a lot of the theory behind what led to today's implementations is beyond my desire to stay awake. Here's an example of the type of thing that trying to understand makes my eyes glaze:

      "An equivalence relation R over Σ* is right invariant if and only if whenever xRy, then xzRyz for all z ∈ Σ*. An equivalence relation is of finite index if and only if there are only finitely many equivalence classes under R

      (from Robert Constable's paper "The Role of Finite Automata in the Development of Modern Coputing Theory" in The Kleene Symposium, North-Holland Publishing, 1980).

      You know what they say: "Those who can, do, and those who can't, teach". My book teaches how to use what's there now. It would please me to no end if I could throw out all the chapters on optimization and such, and perhaps your article is a first step toward that goal. Some won't appreciate being told that the emperor has fewer clothes than has been thought, but don't let that deter you.

      I do think that along the way you'll find some unexpected problems related to the semantics of capturing parentheses, for example, but heck, if you're half as good an engineer as you are a writer, I'll hold out hope, because the writing and presentation in your article is really top notch.

         Jeffrey
      ----------------------------
      Jeffrey Friedl    http://regex.info/blog/

        I wholeheartedly agree that the dry language of most finite automata papers would put anyone to sleep. It does serve an important purpose, namely making things precise enough to prove things mathematically. I certainly understand why you're falling asleep. Those papers are written for mathematicians, not programmers. ;-)

        I think the fundamental problem is the disconnect you pointed out in your blog post: when you (and, it turns out, most programmers) say DFA and NFA you mean something related to but different from what the mathematicians mean. It is true that saying things about the programming concepts, which are fuzzy and implementation-specific, is not too worthwhile, since they are always shifting, but it's not true of the mathematical ideas, which have strong mathematical proofs associated with them, telling what one can and cannot do as far as implementation strategies and efficiency.
        It would please me to no end
        Nit: the common phrase is "please me no end", that is, please infinitely. "please me to no end" would mean the pleasing would be meaningless.

      I hear that the engine has gotten relatively easy to replace in perl 5.10. It'd be wrong to use a DFA with code that was written for perl's default NFA but you could surely plug it in for code that knew it was going to use the DFA. Have you given any thought to trying this out for real?

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      What's in Friedl's book about that topic is misleading at best. Friedl freely admits that he doesn't care for the underlying theory.
      I followed that same link when I first read your article and still have no idea what you are referring to. Would you please quote the statement where he expresses his feelings about the underlying theory?
        The quote that sums it up most obviously is:
        What does all this mean to you, as a user? Absolutely nothing. As a user, you don't care if it's regular, nonregular, unregular, irregular, or incontinent. So long as you know what you can expect from it (something this chapter will show you), you know all you need to care about.
        This isn't true at all. The choice of whether to use real regular languages or not has direct implications for the efficiency of the matching. The reality is that you can expect very little from backtracking regex implementations as far as performance guarantees go.

        But even more important is the comment about tweaking NFAs in ad hoc ways:
        As a programmer, if you have a true (mathematically speaking) NFA regex engine, it is a relatively small task to add support for backreferences. A DFA's engine's design precludes the adding of this support, but an NFA's common implementation makes it trivial. In doing so, you create a more powerful tool, but you also make it decidedly nonregular (mathematically speaking). What does this mean? At most, that you should probably stop calling it an NFA, and start using the phrase “nonregular expressions,” since that describes (mathematically speaking) the new situation.
        Tweaking the NFA implementation and extending the regex syntax means a lot more than that. For one thing it means that you've now created situations where some regexes can't be matched efficiently. It is a very big deal if you are treating regular expressions as describing the strings rather than describing the algorithms (see my much larger comment elsewhere in this thread). Also the comment about having a "true (mathematically speaking) NFA regex engine" is complete nonsense. He's talking about a backtracking implementation, which the word NFA has nothing to do with (mathematically speaking).