in reply to Re: Perl regexp matching is slow??
in thread Perl regexp matching is slow??

In order to satisfy that "descriptive" you would have to build DFA if for no other reason then to make all |-s equal and to deduce non-greedy multipliers. Just to remind you that once you cross the Rubicon to "descriptive" non-greedy intent has to be deduced.
So, at the very least, you would incur perf hit all the time for the sake of a fringe case. Doesn't really matter if you would hide it behind the dynamic building. So if the DFA has the exponential explosion of states you would eventually reach the point of explosion. Sorry I can't find the one that Ullman used to throw at unsuspecting souls :-), but I'm sure you'll find enough of these on the net ...
You might get away with a stuff like "on the fly" DFA for something that is one-time run, but in such case the whole debate is artificial and academic.
For a real world run, when say I have a regex compiled all the way to the binary, say in .NET, and am paying the price of never being able to "unload" it for the lifetime of the binary, and the binary happens to be a server required to hit 99.9999 and is running that regexp on all incoming text ... I think you get the picture :-) and what's the purpose of non-greedy multipliers and all these assertions you dislike so much and why even a thought that someone might be building an DFA on the fly is not a realistic option. Even if it could handle back-references, or more importantly a recursive parse with implicit stack - which it can't.
See, we actually need the interpretative, ahem interpretation :-) of the regex and the one that is stable across languages and platforms - thereby the PCRE and the only real "price" we "pay" is that we use non-greedy multipliers and we are done. Most horrible thing, isn't it :-)
Now we could debate that we might prefer to have non-greedy as a default but that's a historical note :-) Also a matter of how precisely one wants to define the input he'll take in. Some people may insist on using negative classes instead, but they don't write regexps to match a in a :-).
As for the back-refs and recursive parse, that's going to be more in demand, not less. For example .NET implementation of Perl regex was designed to be able to parse XML without a perf hit and while I couldn't say if Perl would take a perf hit for something like that, it definitely can parse XML with regex much easier then .NET.
So, you can go cursing everyone, starting with Thompson himself :-) or you can kind of come to terms with the fact that regex engines in widespread use are practical tools and not academic toys. So, claiming that one engine is "slow" because it will have a problem with a ridiculous thing like a?{3}a{3} on "aaa" but will run a{0,30}?a{30} without a hiccup looks like the ultimate exercise in futility. If for some reason you really, honestly need to match something like that, well you can always write a regex, to transform "a?{n}a{n}" into "a{0,n}?a{n}" :-)) Might need back-refs to handle more convoluted cases though :-)))))
In the meantime I will happily cherish Perl/PCRE/.NET etc. regex in all it's universal presence and benefits it brings - for all people with real needs for fast string parsing.

Replies are listed 'Best First'.
Re^3: Perl regexp matching is slow??
by demerphq (Chancellor) on Feb 27, 2007 at 15:08 UTC

    As for the back-refs and recursive parse, that's going to be more in demand, not less. For example .NET implementation of Perl regex was designed to be able to parse XML without a perf hit and while I couldn't say if Perl would take a perf hit for something like that, it definitely can parse XML with regex much easier then .NET.

    Im not sure what performance hit you mean. Although ambrus had some neat ideas for something like the super-linear cache with regards to pattern level recursion that i have yet to follow up on.

    Also juerd has published a validation regex for xml that is derived from the EBNF that w3 has posted. The story that you cant parse XML or HTML with regexes is going to go up in a puff of smoke when Perl 5.10 comes out.

    ---
    $world=~s/war/peace/g

      The problem is that it's still tricky to get at the parse tree. All the parts are there -- named captures, relative backreferences, recursion -- but you still have to litter your regexp with hairy use of
      (?{ local $FOO{blah} = $^N })
      to get the data back.

      It should be possible to whip something up with re::engine and Regexp::Parser, but a brief attempt showed me that this is somewhat tricky to get right except in simple cases. It would be much more convenient to have this support built into the regex engine. Perhaps for 5.12...

        abigail mentioned something similar, but i havent gotten my head around what is really needed. If i had a better idea id be able to given a better estimate of when it could be done by. For instance there is a considerable amount of data on the stack after a match that might be useful for this purpose, i just dont really have a good grip on the problem to make any useful suggestions.

        ---
        $world=~s/war/peace/g