Gangabass has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Monks.

I can't realize why this regex freeze my computer. Here is code:

#!/usr/bin/perl use strict; use warnings; my $bid_str = <<END; <td class='list_cell'> <a name='15'></a> <span style="font-size: 120%;">Dec 15.07</span><br> <span style="font-size: 120%;">[--:--]</span> </td> <td class='list_cell'> <span style="font-size: 120%;" target='_top'>9998</span><b +r> Okayama </td> <td class='list_cell' style="font-size: 110%;">1988</td> <td class='list_cell' align="left"> <a href='e_shitami.asp?kaijo=OK&ss=9000&se=9999&m=&s=SyupN +o&o=ASC&ap=15&pgcd=E-113&lst=E-111&p=1&listpos=15&tkaijo=OK&taa=561&t +syup=9998' target='_top'> </a><br> </td> <td class='list_cell' style="font-size: 120%;"> FA<br> </td> <td class='list_cell' align="right" style="font-size: 120%;"> 0 <br> </td> <td class='list_cell'>white<br></td> <td class='list_cell' align="left"> &nbsp;&nbsp; <br> AC </td> <td class='list_cell' align='left'>&nbsp;<img src='../../image +s/clear.gif' width='12' height='12' style='display:inline'>&nbsp;non +Auction</td> <td class='list_cell' align='right'> ----<br> 88,888 </td> <td class='list_cell' valign='middle' style="font-size:120%;"> +</td> END my ( $date_shedule, $bid_num_site, $year_raw, $model_grade, $transmiss +ion_displacement, $mileage_inspection, $color_raw, $type_equipment, $ +result_raw, $final_start, $scores_raw ) = $bid_str =~ m{<td[^>]*>\s*( +.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\ +s*<td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s +*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td +>\s*<td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*> +\s*(.+?)\s*</td>\s*}is;

If i change regex to (i show only last part)(i replace + with *) :

...<td[^>]*>\s*(.*?)\s*</td>\s*}is;

it works fine. I think it must simple return no match but it freeze!

Replies are listed 'Best First'.
Re: Why this simple regex freeze my computer? (backtracking)
by lodin (Hermit) on Dec 15, 2007 at 12:06 UTC

    shmem has already pointed out that there are other probably better ways to solve this.

    It's still interesting, though, to ponder why a simple change from * to + can cause the pattern to run for a very long time.

    The problem is the last line:

    <td class='list_cell' valign='middle' style="font-size:120%;"></td>
    It makes the pattern fail. But the regex engine doesn't give up so easily, and so it tries to backtrack each quantifier in the pattern before that. Your biggest problem here is probably that .+? can match too much. It can pass a <td></td> tag, and then try to match again. For instance, instead of matching the second last, it can skip that and try to match the very last, but it will definately fail that too. Without an ^ anchor the engine can also try to start matching at the second <td></td> tag, and so on. It sums up to an awful lot of backtracking.

    The easiest way to solve this is to use the (?>) assertion. It's the "anti-backtrack" assertion. Once the pattern inside the assertion has matched, it won't backtrack. The match is freezed. (The engine may of course backtrack behind the assertion, and then it'll try to match again at the new place.) So a line in your pattern may look like

    (?><td[^>]*>\s*(.+?)\s*</td>\s*)
    depending on what you really need. I hope I managed to be somewhat clear and understandable.

    If an HTML parser isn't doing the trick for you, you might want to consider using split or a m//g to extract relevant parts, and then handle those separately. That way you easily avoid a lot of backtracking, if the patterns become more sophisticated.

    lodin

      ++Nice explanation, and another reason why you shouldn't use regexes to parse HTML. One single error (or one single change in the html) can freeze your box and spoil your carefully crafted regex. As matija very succinctly said,
      By the time you've resolved all those problems, you've written the better part of a HTML parser.

      Of course you learn a lot trying ;-)

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Why this simple regex freeze my computer?
by shmem (Chancellor) on Dec 15, 2007 at 11:51 UTC
Re: Why this simple regex freeze my computer?
by mrpeabody (Friar) on Dec 17, 2007 at 05:09 UTC
    m{<td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\s* <td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\s* <td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\s* <td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\s* <td[^>]*>\s*(.+?)\s*</td>\s*<td[^>]*>\s*(.+?)\s*</td>\s* <td[^>]*>\s*(.+?)\s*</td>\s*}is;

    is not my idea of a simple regex.

    It could also be written more clearly as (untested):

    my $text = "..."; my $cell = qr{ <td[^>]*> \s* (.+?) \s* </td> \s* }isx; my @result = $text =~ m/ $cell $cell $cell $cell $cell $cell $cell $cell $cell $cell $cell /x;