in reply to Backtracking hurts: slow regexp

What hurts in your case are the nested quantifiers, ie (.*<\/td>){9}. You might use (?>.*?<\/td>){9} instead, it's probably much faster, and what it does is closer to what you think it does.

That said, use a proper HTML parsing module from CPAN, and extract your desired information from the parse tree.

Replies are listed 'Best First'.
Re^2: Backtracking hurts: slow regexp
by faibistes (Novice) on Dec 19, 2008 at 12:57 UTC
    Sir, you're the man. Obviously you've understood perfectly what I was trying to achieve: parsing column-delimited HTML data. But I don't understand why is is your expression equivalent to mine. I'm a newbie at perl, and reading the perlreref I guessed that the solution might have been in ?>, but to be honest, my head hurts when I try to make head or tails of it. Could you please try to explain, in the simpler possible way, how does ?> work?
      From the forementioned perlreref:
      (?>...) Grab what we can, prohibit backtracking
      that's it. it does not allow backtracking. so, the (?>.*?<\/td>){9}will get exactly 9 instances of (non-greedy) anything followed by </td>... it won't try to go till the end of the string chasing the longest .* (because it is not greedy) and if the last . of the sequence is not followed by </td>, it will fail without backtracking (working more or less as a deterministic automaton).
      []s, HTH, Massa (κς,πμ,πλ)