in reply to Re: Regex infinite loop?
in thread Regex infinite loop?

I can't give you a technical answer to this, but I can tell you that I have been running this code for many weeks now. It runs once a day at the same time. When I run the program manually, it makes each match fairly quickly (a few seconds, at most).

One thing, though. The code has been running fine for weeks, but whoever runs the website made some minor changes to the HTML. This forced me to have to go back and make a couple of (extremely) minor changes to the matching code. That said, I would note that the code does match, and match quickly, for probably 90% of the pages that I am pulling. It only hangs on a couple. The problem appears page-specific because when I change the order in which I pull the pages, it hangs on the same pages (regardless of where they are in my list of pages).

Replies are listed 'Best First'.
Re^3: Regex infinite loop?
by CountZero (Bishop) on Oct 16, 2008 at 20:57 UTC
    Is both the regex and the html of the page that hangs your code a secret? I'm quite sure someone would already have found the solution if you posted the regex and the page code (or the url to that page).

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Okay, here is the relevant (I think) piece of the code

      if ($tempType eq "Bid") { while ($content =~m%<td class="marketWatch" style="width: 20p +x;"><p class="pngfix" id="market-watch-(.+?)">&#160;</p>.*?ugid=(.+?) +&marketId=(.+?)&action=add';};.+?<td style="width: 22\%;">\s+<a href= +"/marketplace/sports/nfl/super-bowl-xliii.*?>(.+?)</a>\s+</td>\s+<td +class="qty">.+?(?:"selected"\s+?>|<input type="hidden" id="qty_select +_\d+" value=")(.+?)(?:</option>|"/>).+?<span class="price">\$(.+?)</s +pan>%gs) { print(STDERR "[", pos($content), "]") ; $tempCount = $tempCount+1 ; print "\ntempCount = $tempCount\n" ; $tempMarketWatchID = $1 ; $tempugID = $2 ; $tempmarketID = $3 ; $tempLocation = $4 ; $tempBidQuantity = $5 ; $tempBid = $6 ; $tempLocation =~ s/^\s+// ; $tempLocation =~ s/\s+$// ; print "$date\t$time\t$tempType\t$teamName\t$tempMarketWatchID\t +$tempugID\t$tempmarketID\t$tempLocation\t$tempBidQuantity\t$tempBid\n +" ; } #end while print "Leaving the bid while\n" ; } #end if

      The pages that I am searching are:

      http://www.firstdibz.com/markets/sports/nfl/ravens/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/bills/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/bengals/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/browns/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/broncos/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/texans/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/jaguars/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/chiefs/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/dolphins/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/patriots/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/jets/super-bowl-xliii/bid/ +show/buy http://www.firstdibz.com/markets/sports/nfl/raiders/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/steelers/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/chargers/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/titans/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/cardinals/super-bowl-xliii +/bid/show/buy http://www.firstdibz.com/markets/sports/nfl/falcons/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/panthers/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/bears/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/cowboys/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/lions/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/packers/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/vikings/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/saints/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/giants/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/eagles/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/49ers/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/seahawks/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/rams/super-bowl-xliii/bid/ +show/buy http://www.firstdibz.com/markets/sports/nfl/buccaneers/super-bowl-xlii +i/bid/show/buy http://www.firstdibz.com/markets/sports/nfl/redskins/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/colts/super-bowl-xliii/bid +/show/buy

      The pages that cause my code to hang are the colts and the jets. There maybe others, but I know that every page prior to the jets works fine. After the jets URL, I don't know as I have not tested. (The colts used to be earlier in my list, but I moved it when it started hanging.)

        Yikes ! This gave me an opportunity to learn a bit about debuggering regex.

        The pragma use re 'debug' will tell you more than you ever thought you might have wanted to know about the regex being processed.

        The problem I found was that the 'colts' html contains:

          <td class="qty" style="width: 7%;">
        
        where your regex was looking for:
          <td class="qty">
        
        Because the regex is littered with .*?, there is lots of scope for backtracking. As far as I can see:

        • first it backtracks to (.+?)</a> and goes screaming forward trying to find another </a>;...
        • ...which it finds, but the next <td class="qty" style...> also fails...
        • ...so, when there are no more </a>; to try, it backtracks to .*?>(.+?)</a>...
        • ...I started to lose the will to live at this point, but observe that there are five earlier .*? available to backtrack to...

        I doubt that this is an infinite loop. But I haven't the patience to establish one way or the other.

        Not sure what to suggest here... Where possible I would replace .*? by, for example, [^<]*? -- which limits the scope for backtracking. There is other regex magic to limit backtracking, but I'm not familiar with it.

        However, the big problem I see is that you can never be sure whether your regex has failed because there is no more data of interest, or because it's not recognising a small variation in format. Do you have a drawing board ?

        Having seen the pattern, and the description you give of the problem, I suspect that when the program appears to hang, it's actually really really busy trying to match the pattern. It probably won't match, but if it does (after a long time), it's likely to not match what you intended.

        You're match is sprinkled with .* and .+. That's usually fine if you know it's going to match. But if it doesn't (and the optimizer hasn't determined this already - but clearly in your case, it hasn't) perl will try every possible length. I count 11 uses of .* and .+, which means your match is running in O(n11) time, where n is the length of the HTML document.

        That will take a while. You'd be much better off in making your .* and .+ much more restrictive. For instance,

        id="market-watch-(.+?)"
        can probably be written as
        id="market-watch-([^"]++)" # Lose one + if you don't use 5.10
        You could consider using a parser.

        This extracts the first five bits of data you're after from a fragment of the first url you listed. Sure, there is a lot of it but, imo, this way will be easier to write/maintain and easily adapted should the HTML change (which it will).

        #!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $t = HTML::TreeBuilder->new_from_content( $html ) or die qq{cant parse string\n}; my $td = $t->look_down( q{_tag} => q{td}, q{class} => q{marketWatch}, q{style} => qr{width:\s+20px;}, ) or die qq{cant find td\n}; my $script = $td->look_down( q{_tag} => q{script}, ) or die qq{cant find script\n}; my $js = $script->as_HTML; my ($ugid, $marketid) = $js =~ /ugid=([^&]+)&marketId=([^&]+)&/; printf qq{%-10s: %s\n}, q{ugid}, $ugid; printf qq{%-10s: %s\n}, q{marketId}, $marketid; $td = $t->look_down( q{_tag} => q{td}, q{style} => qr{width:\s+22%}, ) or die qq{cant find name\n}; printf qq{%-10s: %s\n}, q{name}, $td->as_trimmed_text; $td = $t->look_down( q{_tag} => q{td}, q{class} => q{qty}, ) or die qq{cant find qty\n}; printf qq{%-10s: %s\n}, q{qty}, $td->as_trimmed_text; $td = $t->look_down( q{_tag} => q{td}, q{class} => q{price}, ) or die qq{cant find qty\n}; printf qq{%-10s: %s\n}, q{price}, $td->as_trimmed_text; # gratuitous white space wiped __DATA__ <td class="marketWatch" style="width: 20px;"><p class="pngfix" id="mar +ket-watch-1">&#160;</p> <script type="text/javascript"> $('market-watch-1').onclick=function(){location.href='/marketWatch.h +tml?ugid=696398000&marketId=700213356&action=add';}; </script> </td> <td style="width: 25%;">Super Bowl XLIII</td> <td style="width: 22%;"> <a href="/marketplace/sports/nfl/super-bowl-xliii/upper-deck-end-zon +e/1/ravens">Upper Deck End Zone</a> </td> <td class="qty"> <input type="hidden" id="qty_select_1" value="2"/>2 </td> <td class="price" style="width: 18%;"> <input type="hidden" id="trade_price_1" value="25" /> <span class="price">$25.00</span> </td>
        ugid : 696398000 marketId : 700213356 name : Upper Deck End Zone qty : 2 price : $25.00