shmem has already pointed out that there are other probably better ways to solve this.

It's still interesting, though, to ponder why a simple change from * to + can cause the pattern to run for a very long time.

The problem is the last line:

<td class='list_cell' valign='middle' style="font-size:120%;"></td>
It makes the pattern fail. But the regex engine doesn't give up so easily, and so it tries to backtrack each quantifier in the pattern before that. Your biggest problem here is probably that .+? can match too much. It can pass a <td></td> tag, and then try to match again. For instance, instead of matching the second last, it can skip that and try to match the very last, but it will definately fail that too. Without an ^ anchor the engine can also try to start matching at the second <td></td> tag, and so on. It sums up to an awful lot of backtracking.

The easiest way to solve this is to use the (?>) assertion. It's the "anti-backtrack" assertion. Once the pattern inside the assertion has matched, it won't backtrack. The match is freezed. (The engine may of course backtrack behind the assertion, and then it'll try to match again at the new place.) So a line in your pattern may look like

(?><td[^>]*>\s*(.+?)\s*</td>\s*)
depending on what you really need. I hope I managed to be somewhat clear and understandable.

If an HTML parser isn't doing the trick for you, you might want to consider using split or a m//g to extract relevant parts, and then handle those separately. That way you easily avoid a lot of backtracking, if the patterns become more sophisticated.

lodin


In reply to Re: Why this simple regex freeze my computer? (backtracking) by lodin
in thread Why this simple regex freeze my computer? by Gangabass

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.