Re^3: Regex infinite loop?

Replies are listed 'Best First'.
Re^4: Regex infinite loop? by Ninth Prince (Acolyte) on Oct 17, 2008 at 00:39 UTC
Okay, here is the relevant (I think) piece of the code if ($tempType eq "Bid") { while ($content =~m%<td class="marketWatch" style="width: 20p +x;"><p class="pngfix" id="market-watch-(.+?)"> </p>.?ugid=(.+?) +&marketId=(.+?)&action=add';};.+?<td style="width: 22\%;">\s+<a href= +"/marketplace/sports/nfl/super-bowl-xliii.?>(.+?)</a>\s+</td>\s+<td +class="qty">.+?(?:"selected"\s+?>\|<input type="hidden" id="qty_select +_\d+" value=")(.+?)(?:</option>\|"/>).+?<span class="price">\$(.+?)</s +pan>%gs) { print(STDERR "[", pos($content), "]") ; $tempCount = $tempCount+1 ; print "\ntempCount = $tempCount\n" ; $tempMarketWatchID = $1 ; $tempugID = $2 ; $tempmarketID = $3 ; $tempLocation = $4 ; $tempBidQuantity = $5 ; $tempBid = $6 ; $tempLocation =~ s/^\s+// ; $tempLocation =~ s/\s+$// ; print "$date\t$time\t$tempType\t$teamName\t$tempMarketWatchID\t +$tempugID\t$tempmarketID\t$tempLocation\t$tempBidQuantity\t$tempBid\n +" ; } #end while print "Leaving the bid while\n" ; } #end if [download] The pages that I am searching are: http://www.firstdibz.com/markets/sports/nfl/ravens/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/bills/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/bengals/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/browns/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/broncos/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/texans/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/jaguars/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/chiefs/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/dolphins/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/patriots/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/jets/super-bowl-xliii/bid/ +show/buy http://www.firstdibz.com/markets/sports/nfl/raiders/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/steelers/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/chargers/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/titans/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/cardinals/super-bowl-xliii +/bid/show/buy http://www.firstdibz.com/markets/sports/nfl/falcons/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/panthers/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/bears/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/cowboys/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/lions/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/packers/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/vikings/super-bowl-xliii/b +id/show/buy http://www.firstdibz.com/markets/sports/nfl/saints/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/giants/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/eagles/super-bowl-xliii/bi +d/show/buy http://www.firstdibz.com/markets/sports/nfl/49ers/super-bowl-xliii/bid +/show/buy http://www.firstdibz.com/markets/sports/nfl/seahawks/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/rams/super-bowl-xliii/bid/ +show/buy http://www.firstdibz.com/markets/sports/nfl/buccaneers/super-bowl-xlii +i/bid/show/buy http://www.firstdibz.com/markets/sports/nfl/redskins/super-bowl-xliii/ +bid/show/buy http://www.firstdibz.com/markets/sports/nfl/colts/super-bowl-xliii/bid +/show/buy [download] The pages that cause my code to hang are the colts and the jets. There maybe others, but I know that every page prior to the jets works fine. After the jets URL, I don't know as I have not tested. (The colts used to be earlier in my list, but I moved it when it started hanging.)	[reply] [d/l] [select]
Re^5: Regex infinite loop? by gone2015 (Deacon) on Oct 17, 2008 at 12:37 UTC
Yikes ! This gave me an opportunity to learn a bit about debuggering regex. The pragma `use re 'debug'` will tell you more than you ever thought you might have wanted to know about the regex being processed. The problem I found was that the 'colts' html contains: <td class="qty" style="width: 7%;"> where your regex was looking for: <td class="qty"> Because the regex is littered with `.?`, there is lots of scope for backtracking. As far as I can see: first it backtracks to `(.+?)</a>` and goes screaming forward trying to find another `</a>;`... ...which it finds, but the next `<td class="qty" style...>` also fails... ...so, when there are no more `</a>;` to try, it backtracks to `.?>(.+?)</a>`... ...I started to lose the will to live at this point, but observe that there are five earlier `.?` available to backtrack to... I doubt that this is an infinite loop. But I haven't the patience to establish one way or the other. Not sure what to suggest here... Where possible I would replace `.?` by, for example, `[^<]*?` -- which limits the scope for backtracking. There is other regex magic to limit backtracking, but I'm not familiar with it. However, the big problem I see is that you can never be sure whether your regex has failed because there is no more data of interest, or because it's not recognising a small variation in format. Do you have a drawing board ?	[reply] [d/l] [select]
Re^6: Regex infinite loop? by Ninth Prince (Acolyte) on Oct 17, 2008 at 14:32 UTC
Thank you very much. I'm going to do a mea culpa here. The code is not, in fact, hanging -- it's just taking a very long time to execute. The thing that confuses me, though, is that for most pages, it does all of the matching in a matter of a few seconds. For the colts and the jets however, it takes something like 2 HOURS. Intuitively, it seems like there must be something that is different about their pages, but I'm not sure what it is. I think that I need to learn more about how the regex engine works. I clearly don't have a good understanding of backtracking. Maybe it's time for me to get myself a copy of "Mastering Regular Expressions!" I will give what you have suggested a try. Thank you very much for your efforts on my behalf.	[reply]
Re^5: Regex infinite loop? by JavaFan (Canon) on Oct 17, 2008 at 13:48 UTC
Having seen the pattern, and the description you give of the problem, I suspect that when the program appears to hang, it's actually really really busy trying to match the pattern. It probably won't match, but if it does (after a long time), it's likely to not match what you intended. You're match is sprinkled with .* and .+. That's usually fine if you know it's going to match. But if it doesn't (and the optimizer hasn't determined this already - but clearly in your case, it hasn't) perl will try every possible length. I count 11 uses of .* and .+, which means your match is running in `O(n¹¹)` time, where `n` is the length of the HTML document. That will take a while. You'd be much better off in making your .* and .+ much more restrictive. For instance, `id="market-watch-(.+?)"` [download] can probably be written as `id="market-watch-([^"]++)" # Lose one + if you don't use 5.10` [download]	[reply] [d/l] [select]
Re^6: Regex infinite loop? by Ninth Prince (Acolyte) on Oct 17, 2008 at 14:35 UTC
You are correct. The code is not, in fact, hanging -- it's just taking a very long time to execute. I "discovered" this when I unknowingly left the program running over night and came back the next morning--it had executed all the way through. The thing that confuses me, though, is that for most pages, it does all of the matching in a matter of a few seconds. For the colts and the jets however, it takes something like 2 HOURS. Intuitively, it seems like there must be something that is different about their pages, but I'm not sure what it is.	[reply]
Re^7: Regex infinite loop? by JavaFan (Canon) on Oct 17, 2008 at 14:49 UTC
Re^6: Regex infinite loop? by Ninth Prince (Acolyte) on Oct 17, 2008 at 15:46 UTC
I'm hoping you can help me understand a couple of things. First, I thought that if I used the "?" I got non-greedy searching. Shouldn't that be helping things out? Also, I'm trying to understand the code that you have suggested. Within the capturing parentheses you have `[^"]++`. What does this mean exactly? I read it as match anything that's not a double quote. I'm sure that I am wrong. Also, I don't know what the ++ does. I have only ever used one + to indicate "match at least once". What does the double plus, ++, mean? How would I say (in English) what is going on with `[^"]++`?	[reply] [d/l] [select]
Re^7: Regex infinite loop? by JavaFan (Canon) on Oct 17, 2008 at 15:58 UTC
Re^8: Regex infinite loop? by Ninth Prince (Acolyte) on Oct 17, 2008 at 16:09 UTC
Re^8: Regex infinite loop? by Ninth Prince (Acolyte) on Oct 17, 2008 at 17:03 UTC
Re^6: Regex infinite loop? by Ninth Prince (Acolyte) on Oct 17, 2008 at 16:03 UTC
Okay, I think I get at least a part of what `[^"]++` is doing. It is saying "match until you get to the next double quote." Is that correct?	[reply] [d/l]
Re^7: Regex infinite loop? by Anonymous Monk on Oct 18, 2008 at 00:02 UTC
Re^5: Regex infinite loop? by wfsp (Abbot) on Oct 17, 2008 at 14:44 UTC
You could consider using a parser. This extracts the first five bits of data you're after from a fragment of the first url you listed. Sure, there is a lot of it but, imo, this way will be easier to write/maintain and easily adapted should the HTML change (which it will). #!/usr/bin/perl use warnings; use strict; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $t = HTML::TreeBuilder->new_from_content( $html ) or die qq{cant parse string\n}; my $td = $t->look_down( q{_tag} => q{td}, q{class} => q{marketWatch}, q{style} => qr{width:\s+20px;}, ) or die qq{cant find td\n}; my $script = $td->look_down( q{_tag} => q{script}, ) or die qq{cant find script\n}; my $js = $script->as_HTML; my ($ugid, $marketid) = $js =~ /ugid=([^&]+)&marketId=([^&]+)&/; printf qq{%-10s: %s\n}, q{ugid}, $ugid; printf qq{%-10s: %s\n}, q{marketId}, $marketid; $td = $t->look_down( q{_tag} => q{td}, q{style} => qr{width:\s+22%}, ) or die qq{cant find name\n}; printf qq{%-10s: %s\n}, q{name}, $td->as_trimmed_text; $td = $t->look_down( q{_tag} => q{td}, q{class} => q{qty}, ) or die qq{cant find qty\n}; printf qq{%-10s: %s\n}, q{qty}, $td->as_trimmed_text; $td = $t->look_down( q{_tag} => q{td}, q{class} => q{price}, ) or die qq{cant find qty\n}; printf qq{%-10s: %s\n}, q{price}, $td->as_trimmed_text; # gratuitous white space wiped __DATA__ <td class="marketWatch" style="width: 20px;"><p class="pngfix" id="mar +ket-watch-1"> </p> <script type="text/javascript"> $('market-watch-1').onclick=function(){location.href='/marketWatch.h +tml?ugid=696398000&marketId=700213356&action=add';}; </script> </td> <td style="width: 25%;">Super Bowl XLIII</td> <td style="width: 22%;"> <a href="/marketplace/sports/nfl/super-bowl-xliii/upper-deck-end-zon +e/1/ravens">Upper Deck End Zone</a> </td> <td class="qty"> <input type="hidden" id="qty_select_1" value="2"/>2 </td> <td class="price" style="width: 18%;"> <input type="hidden" id="trade_price_1" value="25" /> <span class="price">$25.00</span> </td> [download] `ugid : 696398000 marketId : 700213356 name : Upper Deck End Zone qty : 2 price : $25.00` [download]	[reply] [d/l] [select]
Re^6: Regex infinite loop? by Ninth Prince (Acolyte) on Oct 17, 2008 at 15:03 UTC
Wow! I was thinking this might be the way I would (ultimately) have to go. I was reading about TreeBuilder in "Perl & LWP" on the train this morning. I am a PERL novice -- this will be a great learning tool for me! Thank you very much.	[reply]
Re^7: Regex infinite loop? by CountZero (Bishop) on Oct 18, 2008 at 21:21 UTC