in reply to Re^3: Regex infinite loop?
in thread Regex infinite loop?

Okay, here is the relevant (I think) piece of the code

if ($tempType eq "Bid") { while ($content =~m%<td class="marketWatch" style="width: 20p +x;"><p class="pngfix" id="market-watch-(.+?)">&#160;</p>.*?ugid=(.+?) +&marketId=(.+?)&action=add';};.+?<td style="width: 22\%;">\s+<a href= +"/marketplace/sports/nfl/super-bowl-xliii.*?>(.+?)</a>\s+</td>\s+<td +class="qty">.+?(?:"selected"\s+?>|<input type="hidden" id="qty_select +_\d+" value=")(.+?)(?:</option>|"/>).+?<span class="price">\$(.+?)</s +pan>%gs) { print(STDERR "[", pos($content), "]") ; $tempCount = $tempCount+1 ; print "\ntempCount = $tempCount\n" ; $tempMarketWatchID = $1 ; $tempugID = $2 ; $tempmarketID = $3 ; $tempLocation = $4 ; $tempBidQuantity = $5 ; $tempBid = $6 ; $tempLocation =~ s/^\s+// ; $tempLocation =~ s/\s+$// ; print "$date\t$time\t$tempType\t$teamName\t$tempMarketWatchID\t +$tempugID\t$tempmarketID\t$tempLocation\t$tempBidQuantity\t$tempBid\n +" ; } #end while print "Leaving the bid while\n" ; } #end if

The pages that I am searching are:

http://www.firstdibz.com/markets/sports/nfl/ravens/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/bills/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/bengals/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/browns/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/broncos/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/texans/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/jaguars/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/chiefs/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/dolphins/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/patriots/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/jets/super-bowl-xliii/bid/ +show/buy http://www.firstdibz.com/markets/sports/nfl/raiders/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/steelers/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/chargers/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/titans/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/cardinals/super-bowl-xliii +/bid/show/buy http://www.firstdibz.com/markets/sports/nfl/falcons/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/panthers/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/bears/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/cowboys/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/lions/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/packers/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/vikings/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/saints/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/giants/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/eagles/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/49ers/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/seahawks/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/rams/super-bowl-xliii/bid/ +show/buy http://www.firstdibz.com/markets/sports/nfl/buccaneers/super-bowl-xlii +i/bid/show/buy http://www.firstdibz.com/markets/sports/nfl/redskins/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/colts/super-bowl-xliii/bid +/show/buy

The pages that cause my code to hang are the colts and the jets. There maybe others, but I know that every page prior to the jets works fine. After the jets URL, I don't know as I have not tested. (The colts used to be earlier in my list, but I moved it when it started hanging.)

Replies are listed 'Best First'.
Re^5: Regex infinite loop?
by gone2015 (Deacon) on Oct 17, 2008 at 12:37 UTC

    Yikes ! This gave me an opportunity to learn a bit about debuggering regex.

    The pragma use re 'debug' will tell you more than you ever thought you might have wanted to know about the regex being processed.

    The problem I found was that the 'colts' html contains:

      <td class="qty" style="width: 7%;">
    
    where your regex was looking for:
      <td class="qty">
    
    Because the regex is littered with .*?, there is lots of scope for backtracking. As far as I can see:

    • first it backtracks to (.+?)</a> and goes screaming forward trying to find another </a>;...
    • ...which it finds, but the next <td class="qty" style...> also fails...
    • ...so, when there are no more </a>; to try, it backtracks to .*?>(.+?)</a>...
    • ...I started to lose the will to live at this point, but observe that there are five earlier .*? available to backtrack to...

    I doubt that this is an infinite loop. But I haven't the patience to establish one way or the other.

    Not sure what to suggest here... Where possible I would replace .*? by, for example, [^<]*? -- which limits the scope for backtracking. There is other regex magic to limit backtracking, but I'm not familiar with it.

    However, the big problem I see is that you can never be sure whether your regex has failed because there is no more data of interest, or because it's not recognising a small variation in format. Do you have a drawing board ?

      Thank you very much. I'm going to do a mea culpa here. The code is not, in fact, hanging -- it's just taking a very long time to execute. The thing that confuses me, though, is that for most pages, it does all of the matching in a matter of a few seconds. For the colts and the jets however, it takes something like 2 HOURS. Intuitively, it seems like there must be something that is different about their pages, but I'm not sure what it is. I think that I need to learn more about how the regex engine works. I clearly don't have a good understanding of backtracking. Maybe it's time for me to get myself a copy of "Mastering Regular Expressions!"

      I will give what you have suggested a try.

      Thank you very much for your efforts on my behalf.

Re^5: Regex infinite loop?
by JavaFan (Canon) on Oct 17, 2008 at 13:48 UTC
    Having seen the pattern, and the description you give of the problem, I suspect that when the program appears to hang, it's actually really really busy trying to match the pattern. It probably won't match, but if it does (after a long time), it's likely to not match what you intended.

    You're match is sprinkled with .* and .+. That's usually fine if you know it's going to match. But if it doesn't (and the optimizer hasn't determined this already - but clearly in your case, it hasn't) perl will try every possible length. I count 11 uses of .* and .+, which means your match is running in O(n11) time, where n is the length of the HTML document.

    That will take a while. You'd be much better off in making your .* and .+ much more restrictive. For instance,

    id="market-watch-(.+?)"
    can probably be written as
    id="market-watch-([^"]++)" # Lose one + if you don't use 5.10

      You are correct. The code is not, in fact, hanging -- it's just taking a very long time to execute. I "discovered" this when I unknowingly left the program running over night and came back the next morning--it had executed all the way through. The thing that confuses me, though, is that for most pages, it does all of the matching in a matter of a few seconds. For the colts and the jets however, it takes something like 2 HOURS. Intuitively, it seems like there must be something that is different about their pages, but I'm not sure what it is.

        Of course there's something different - their content. And that makes all the difference. (After all, if there wasn't a difference, you wouldn't do this exercise with different pages, would you?)

      I'm hoping you can help me understand a couple of things. First, I thought that if I used the "?" I got non-greedy searching. Shouldn't that be helping things out?

      Also, I'm trying to understand the code that you have suggested. Within the capturing parentheses you have [^"]++. What does this mean exactly? I read it as match anything that's not a double quote. I'm sure that I am wrong. Also, I don't know what the ++ does. I have only ever used one + to indicate "match at least once". What does the double plus, ++, mean? How would I say (in English) what is going on with [^"]++?

        First, I thought that if I used the "?" I got non-greedy searching. Shouldn't that be helping things out?
        You are right about the first part. The secondary ? quantifier makes the match non-greedy. But greedy/non-greedy only makes a (possible) difference if there is a match. It will not change the fact whether something will or will not match. And while there might be a difference in performance when there is a match (it could go either way, but Friedl suggests that using ? is slower in most cases), there will usually not be much of a performance difference if there's no match. Perl will try all possible lengths before giving up, and it hardly matters when starting from longest match working towards shortest or starting with shortest working up to longest.
        Within the capturing parentheses you have [^"]++. What does this mean exactly?
        It means, match as many characters that aren't double quotes, and once you've found that many, do not try with less characters if the regexp engine backtracks to this point. The not "giving back" characters is the meaning of the second +, and was introduced in 5.10. The reason I used it here is that in:
        something[^"]++"
        once the regexp engine has matched 'something', a string of non-double quotes and then a double quote, and the rest of the pattern fails, and hence, the engine backtracks to the matching of the string of non-double quote characters, it's pointless to try it with one character less in the string of non-double quote characters: after all, the next character has to be a double quote.

      Okay, I think I get at least a part of what [^"]++ is doing. It is saying "match until you get to the next double quote." Is that correct?

        You have an extra +. YAPE::Regex::Explain
        #!/usr/bin/perl -- use strict; use warnings; use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr~[^"]+~)->explain; __END__ The regular expression: (?-imsx:[^"]+) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- [^"]+ any character except: '"' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
Re^5: Regex infinite loop?
by wfsp (Abbot) on Oct 17, 2008 at 14:44 UTC
    You could consider using a parser.

    This extracts the first five bits of data you're after from a fragment of the first url you listed. Sure, there is a lot of it but, imo, this way will be easier to write/maintain and easily adapted should the HTML change (which it will).

    #!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $t = HTML::TreeBuilder->new_from_content( $html ) or die qq{cant parse string\n}; my $td = $t->look_down( q{_tag} => q{td}, q{class} => q{marketWatch}, q{style} => qr{width:\s+20px;}, ) or die qq{cant find td\n}; my $script = $td->look_down( q{_tag} => q{script}, ) or die qq{cant find script\n}; my $js = $script->as_HTML; my ($ugid, $marketid) = $js =~ /ugid=([^&]+)&marketId=([^&]+)&/; printf qq{%-10s: %s\n}, q{ugid}, $ugid; printf qq{%-10s: %s\n}, q{marketId}, $marketid; $td = $t->look_down( q{_tag} => q{td}, q{style} => qr{width:\s+22%}, ) or die qq{cant find name\n}; printf qq{%-10s: %s\n}, q{name}, $td->as_trimmed_text; $td = $t->look_down( q{_tag} => q{td}, q{class} => q{qty}, ) or die qq{cant find qty\n}; printf qq{%-10s: %s\n}, q{qty}, $td->as_trimmed_text; $td = $t->look_down( q{_tag} => q{td}, q{class} => q{price}, ) or die qq{cant find qty\n}; printf qq{%-10s: %s\n}, q{price}, $td->as_trimmed_text; # gratuitous white space wiped __DATA__ <td class="marketWatch" style="width: 20px;"><p class="pngfix" id="mar +ket-watch-1">&#160;</p> <script type="text/javascript"> $('market-watch-1').onclick=function(){location.href='/marketWatch.h +tml?ugid=696398000&marketId=700213356&action=add';}; </script> </td> <td style="width: 25%;">Super Bowl XLIII</td> <td style="width: 22%;"> <a href="/marketplace/sports/nfl/super-bowl-xliii/upper-deck-end-zon +e/1/ravens">Upper Deck End Zone</a> </td> <td class="qty"> <input type="hidden" id="qty_select_1" value="2"/>2 </td> <td class="price" style="width: 18%;"> <input type="hidden" id="trade_price_1" value="25" /> <span class="price">$25.00</span> </td>
    ugid : 696398000 marketId : 700213356 name : Upper Deck End Zone qty : 2 price : $25.00

      Wow! I was thinking this might be the way I would (ultimately) have to go. I was reading about TreeBuilder in "Perl & LWP" on the train this morning. I am a PERL novice -- this will be a great learning tool for me! Thank you very much.

        The first step in growing from a novice to an earnest student of Perl is never ever write "PERL". The language is called Perl and the implementation is perl, but "PERL" makes you stand out as someone who knows nothing about it and from your code it is clear you are well past that point.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James