faibistes has asked for the wisdom of the Perl Monks concerning the following question:

Several regexps I've written are behaving terribly slow. I understand this stems from the algorithm (NFA) perl uses. I understand, too, that it's very likely that my regexps are poorly written. Is there a module that changes the regexp algorithm (losing backreferences, I know) so it's a NDA? Or is there a tool (maybe it's theoretically impossible and I'm asking nonsense) that transforms my regexp into a NFA-efficient equivalent? I don't care about $1,$2... for now.
FYI, here's one of the offending regexps (I removed non-greedy repetitions and some nesting to no avail):
(.*<\/td>){2}.*BAJA(.*<\/td>){6}.*FINALIZAD(.*<\/td>){9}
Could split() be my friend in this or similar cases (looking for something in the nth. element in a HTML table)?

Replies are listed 'Best First'.
Re: Backtracking hurts: slow regexp
by moritz (Cardinal) on Dec 19, 2008 at 11:54 UTC
    What hurts in your case are the nested quantifiers, ie (.*<\/td>){9}. You might use (?>.*?<\/td>){9} instead, it's probably much faster, and what it does is closer to what you think it does.

    That said, use a proper HTML parsing module from CPAN, and extract your desired information from the parse tree.

      Sir, you're the man. Obviously you've understood perfectly what I was trying to achieve: parsing column-delimited HTML data. But I don't understand why is is your expression equivalent to mine. I'm a newbie at perl, and reading the perlreref I guessed that the solution might have been in ?>, but to be honest, my head hurts when I try to make head or tails of it. Could you please try to explain, in the simpler possible way, how does ?> work?
        From the forementioned perlreref:
        (?>...) Grab what we can, prohibit backtracking
        that's it. it does not allow backtracking. so, the (?>.*?<\/td>){9}will get exactly 9 instances of (non-greedy) anything followed by </td>... it won't try to go till the end of the string chasing the longest .* (because it is not greedy) and if the last . of the sequence is not followed by </td>, it will fail without backtracking (working more or less as a deterministic automaton).
        []s, HTH, Massa (κς,πμ,πλ)
Re: Backtracking hurts: slow regexp
by JavaFan (Canon) on Dec 19, 2008 at 11:53 UTC
    Perl uses internally an engine that's based on NFA. It is possible to change this into a DFA, but that would require you to write a different regexp engine - all perl currently provides is a mechanism to hook in a different regexp engine.

    As for optimizing your regexp, I do not know if that's possible. I do not know what the regexp is supposed to do. Sure, I can read the regexp, and I know what it does, but if I change anything about it, it will do something else. Which may still do the task you want it to do, but since you don't tell us, I'm not going to guess.

    Tell us, what's the data you are applying it against, and what you want to match?