in reply to Perl regex in real life

I'm not sure that I fully understand your question, but would point out that the pcre (Perl Compatible Regular Expressions) C library is used as part of PHP for preg_match(), preg_replace(). (PHP aping it's betters again ;-))

pcre's are also used in applications such as the Postfix mail system. Part of my own anti-spam solution for my Postfix mail servers uses pcre's for message header and body checks. For reference, here are my current bodychecks and headerchecks files. (For anyone thinking that my regex syntax is weird, I've done them like this so that I can read them easily.)

Replies are listed 'Best First'.
Re^2: Perl regex in real life
by RezaRob (Initiate) on Oct 31, 2008 at 19:28 UTC
    Thanks for posting your filters. I guess what I mean is that apparently in real-life backtracking and related features aren't required. In fact, I find them a source of distraction and bugs in my regexes. For instance, if you want to parse C source code, you might say something like:
    /((?:const\s+)*)\s+(\w+)\s+(\w+);/
    Now suppose the C code has a bug, and it defines a variable like this:
    const int;

    Then your three regex captures look like:
    $1==''
    $2=='const'
    $3==int

    which clearly wasn't the intended result.

    This happens because of backtracking for instance.

    If you look at regex history, people had NFA/DFA machines from computer science theory, and they said a "language" is whatever passes through these machines. So that's what regexes matched. Nowadays, Perl has actually introduced things like non-backtracking constructs, namely '(?>'; essentially, acknowledging the "problem". However, I'm not quite sure why non-backtracking shouldn't be all you need in a "real-life" situation.

    Thanks a lot for everyone's comments.

    Reza.

      Neither backtracking nor captures are really a problem needing to be fixed. Features like (?>) were added to help extend the sorts of things regular expressions can do.

      To take your example, regular expressions are not the problem with why the wrong thing was matched. Your expression allowed that interpretation (or it would have with some minor changes). The problem is that you are using an expression that does not properly cover the case you claim to be looking for.

      To take another example that I think shows why these features are more useful, let's match a US telephone number. A telephone number in the US can take many forms:

      • 445-7890
      • 445 7890
      • 4457890
      • 713 445-7890
      • (713) 445-7890
      • 713-445-7890
      • 7134457890
      • 713 445 7890

      And that leaves out adding a 1 or 0 for long distance and extensions, which people often give as part of the number.

      Matching this set of expressions requires optional characters which (if you are doing captures) requires backtracking. (Not really, but the implementation gets hairier if we discuss that part.)

      So to match a phone number, we would need:

      m{ ( (?: \( \d\d\d \) \s* ) | \d\d\d (?: -? | \s* ) ) ? \d\d\d (?: - | \s* ) \d\d\d\d ) }x;

      Obviously, this appears somewhat complicated and there is quite a bit of possibility for confusion. In this case, however, the problem is not the regex, it's the fact that the phone number format is specified fairly sloppily.

      In fact, the times that I have often found the features you are questioning most useful are when I'm dealing with real world data. Because unlike the stuff (insert pompous tone) I generate, the real world is messy and inconsistent.

      One of the nastiest problems I ever tried to solve was to extract tables of information from text files generated by people at various companies. You have no idea how many weird variations that people can come up with that a person can interpret, but are almost unparseable by computer. Without many of these features, we would not have gotten as far as we did.

      G. Wade
        Neither backtracking nor captures are really a problem needing to be fixed.
        Notice that I never said captures where a "problem". They are an important feature that backtracking tries to address in a way that doesn't loose backwards compatibility with NFA machines.
        * (713) 445-7890 * 713-445-7890 * 7134457890 * 713 445 7890
        Matching this set of expressions requires optional characters which (if you are doing captures) requires backtracking. (Not really, but the implementation gets hairier if we discuss that part.) So to match a phone number, we would need:
        m{ ( (?: \( \d\d\d \) \s* ) | \d\d\d (?: -? | \s* ) ) ? \d\d\d (?: - | + \s* ) \d\d\d\d ) }x;
        I don't see any backtracking here. The "pointer" always only moves forward. The simplistic C-code example that I posted also has the star operator in it's regex, but that's not what causes it to backtrack(i.e. literally go back and erase a previous match.) A conditional like /\)*/ isn't really backtracking in it's own right, it either matches or doesn't match at all. However, to get the parenthesis in your example _precisely_ right, backreferences are actually needed, which is interesting.
        One of the nastiest problems I ever tried to solve was to extract tables of information from text files generated by people at various companies. You have no idea how many weird variations that people can come up with that a person can interpret, but are almost unparseable by computer. Without many of these features, we would not have gotten as far as we did.
        Fair enough, and you're right. I can see sometimes in the real world one just has to get a job done in this way and perhaps quickly. However, I should point out that the long term solution in these cases is a more intelligent system. Solving these with regex is just looking at the problem the wrong way. Think of what Google news does. Or Google translator for that matter. They don't, and can't, use regex for these technologies. You need some level of machine intelligence, that views the problem much more abstractly.

        Reza.
      You see, in the old days, a regex only returned a boolean: match or not-match. Later, people got serious about actually capturing the matched sections, and they introduced backtracking (versus NFA machines) to get that sort of thing done.

      But then, the actual _algorithm_ and internals of what's happening becomes crucial. You now care about more than just a theoretical boolean result: match or not-match.

      When internally it matters what the engine is doing, backtracking makes it very hard and unnatural to predict and to think about. That's not how "real" humans actually think about their mother tongue. They don't seem to seriously "backtrack" in their brains while reading a book.

      So, are there real-life examples of where that's needed?

      Reza.