Re^4: Regex infinite loop?
by Ninth Prince (Acolyte) on Oct 17, 2008 at 00:39 UTC
|
Okay, here is the relevant (I think) piece of the code
if ($tempType eq "Bid") {
while ($content =~m%<td class="marketWatch" style="width: 20p
+x;"><p class="pngfix" id="market-watch-(.+?)"> </p>.*?ugid=(.+?)
+&marketId=(.+?)&action=add';};.+?<td style="width: 22\%;">\s+<a href=
+"/marketplace/sports/nfl/super-bowl-xliii.*?>(.+?)</a>\s+</td>\s+<td
+class="qty">.+?(?:"selected"\s+?>|<input type="hidden" id="qty_select
+_\d+" value=")(.+?)(?:</option>|"/>).+?<span class="price">\$(.+?)</s
+pan>%gs) {
print(STDERR "[", pos($content), "]") ;
$tempCount = $tempCount+1 ;
print "\ntempCount = $tempCount\n" ;
$tempMarketWatchID = $1 ;
$tempugID = $2 ;
$tempmarketID = $3 ;
$tempLocation = $4 ;
$tempBidQuantity = $5 ;
$tempBid = $6 ;
$tempLocation =~ s/^\s+// ;
$tempLocation =~ s/\s+$// ;
print "$date\t$time\t$tempType\t$teamName\t$tempMarketWatchID\t
+$tempugID\t$tempmarketID\t$tempLocation\t$tempBidQuantity\t$tempBid\n
+" ;
} #end while
print "Leaving the bid while\n" ;
} #end if
The pages that I am searching are:
http://www.firstdibz.com/markets/sports/nfl/ravens/super-bowl-xliii/bi
+d/show/buy
http://www.firstdibz.com/markets/sports/nfl/bills/super-bowl-xliii/bid
+/show/buy
http://www.firstdibz.com/markets/sports/nfl/bengals/super-bowl-xliii/b
+id/show/buy
http://www.firstdibz.com/markets/sports/nfl/browns/super-bowl-xliii/bi
+d/show/buy
http://www.firstdibz.com/markets/sports/nfl/broncos/super-bowl-xliii/b
+id/show/buy
http://www.firstdibz.com/markets/sports/nfl/texans/super-bowl-xliii/bi
+d/show/buy
http://www.firstdibz.com/markets/sports/nfl/jaguars/super-bowl-xliii/b
+id/show/buy
http://www.firstdibz.com/markets/sports/nfl/chiefs/super-bowl-xliii/bi
+d/show/buy
http://www.firstdibz.com/markets/sports/nfl/dolphins/super-bowl-xliii/
+bid/show/buy
http://www.firstdibz.com/markets/sports/nfl/patriots/super-bowl-xliii/
+bid/show/buy
http://www.firstdibz.com/markets/sports/nfl/jets/super-bowl-xliii/bid/
+show/buy
http://www.firstdibz.com/markets/sports/nfl/raiders/super-bowl-xliii/b
+id/show/buy
http://www.firstdibz.com/markets/sports/nfl/steelers/super-bowl-xliii/
+bid/show/buy
http://www.firstdibz.com/markets/sports/nfl/chargers/super-bowl-xliii/
+bid/show/buy
http://www.firstdibz.com/markets/sports/nfl/titans/super-bowl-xliii/bi
+d/show/buy
http://www.firstdibz.com/markets/sports/nfl/cardinals/super-bowl-xliii
+/bid/show/buy
http://www.firstdibz.com/markets/sports/nfl/falcons/super-bowl-xliii/b
+id/show/buy
http://www.firstdibz.com/markets/sports/nfl/panthers/super-bowl-xliii/
+bid/show/buy
http://www.firstdibz.com/markets/sports/nfl/bears/super-bowl-xliii/bid
+/show/buy
http://www.firstdibz.com/markets/sports/nfl/cowboys/super-bowl-xliii/b
+id/show/buy
http://www.firstdibz.com/markets/sports/nfl/lions/super-bowl-xliii/bid
+/show/buy
http://www.firstdibz.com/markets/sports/nfl/packers/super-bowl-xliii/b
+id/show/buy
http://www.firstdibz.com/markets/sports/nfl/vikings/super-bowl-xliii/b
+id/show/buy
http://www.firstdibz.com/markets/sports/nfl/saints/super-bowl-xliii/bi
+d/show/buy
http://www.firstdibz.com/markets/sports/nfl/giants/super-bowl-xliii/bi
+d/show/buy
http://www.firstdibz.com/markets/sports/nfl/eagles/super-bowl-xliii/bi
+d/show/buy
http://www.firstdibz.com/markets/sports/nfl/49ers/super-bowl-xliii/bid
+/show/buy
http://www.firstdibz.com/markets/sports/nfl/seahawks/super-bowl-xliii/
+bid/show/buy
http://www.firstdibz.com/markets/sports/nfl/rams/super-bowl-xliii/bid/
+show/buy
http://www.firstdibz.com/markets/sports/nfl/buccaneers/super-bowl-xlii
+i/bid/show/buy
http://www.firstdibz.com/markets/sports/nfl/redskins/super-bowl-xliii/
+bid/show/buy
http://www.firstdibz.com/markets/sports/nfl/colts/super-bowl-xliii/bid
+/show/buy
The pages that cause my code to hang are the colts and the jets. There maybe others, but I know that every page prior to the jets works fine. After the jets URL, I don't know as I have not tested. (The colts used to be earlier in my list, but I moved it when it started hanging.) | [reply] [d/l] [select] |
|
|
Yikes ! This gave me an opportunity to learn a bit about debuggering regex.
The pragma use re 'debug' will tell you more than you ever thought you might have wanted to know about the regex being processed.
The problem I found was that the 'colts' html contains:
<td class="qty" style="width: 7%;">
where your regex was looking for:
<td class="qty">
Because the regex is littered with .*?, there is lots of scope for backtracking. As far as I can see:
- first it backtracks to (.+?)</a> and goes screaming forward trying to find another </a>;...
- ...which it finds, but the next <td class="qty" style...> also fails...
- ...so, when there are no more </a>; to try, it backtracks to .*?>(.+?)</a>...
- ...I started to lose the will to live at this point, but observe that there are five earlier .*? available to backtrack to...
I doubt that this is an infinite loop. But I haven't the patience to establish one way or the other.
Not sure what to suggest here... Where possible I would replace .*? by, for example, [^<]*? -- which limits the scope for backtracking. There is other regex magic to limit backtracking, but I'm not familiar with it.
However, the big problem I see is that you can never be sure whether your regex has failed because there is no more data of interest, or because it's not recognising a small variation in format. Do you have a drawing board ?
| [reply] [d/l] [select] |
|
|
Thank you very much. I'm going to do a mea culpa here. The code is not, in fact, hanging -- it's just taking a very long time to execute. The thing that confuses me, though, is that for most pages, it does all of the matching in a matter of a few seconds. For the colts and the jets however, it takes something like 2 HOURS. Intuitively, it seems like there must be something that is different about their pages, but I'm not sure what it is. I think that I need to learn more about how the regex engine works. I clearly don't have a good understanding of backtracking. Maybe it's time for me to get myself a copy of "Mastering Regular Expressions!"
I will give what you have suggested a try.
Thank you very much for your efforts on my behalf.
| [reply] |
|
|
Having seen the pattern, and the description you give of the problem, I suspect that when the program appears to hang, it's actually really really busy trying to match the pattern. It probably won't match, but if it does (after a long time), it's likely to not match what you intended.
You're match is sprinkled with .* and .+. That's usually fine if you know it's going to match. But if it doesn't (and the optimizer hasn't determined this already - but clearly in your case, it hasn't) perl will try every possible length. I count 11 uses of .* and .+, which means your match is running in O(n11) time, where n is the length of the HTML document.
That will take a while. You'd be much better off in making your .* and .+ much more restrictive. For instance,
id="market-watch-(.+?)"
can probably be written as
id="market-watch-([^"]++)" # Lose one + if you don't use 5.10
| [reply] [d/l] [select] |
|
|
You are correct. The code is not, in fact, hanging -- it's just taking a very long time to execute. I "discovered" this when I unknowingly left the program running over night and came back the next morning--it had executed all the way through. The thing that confuses me, though, is that for most pages, it does all of the matching in a matter of a few seconds. For the colts and the jets however, it takes something like 2 HOURS. Intuitively, it seems like there must be something that is different about their pages, but I'm not sure what it is.
| [reply] |
|
|
|
|
I'm hoping you can help me understand a couple of things. First, I thought that if I used the "?" I got non-greedy searching. Shouldn't that be helping things out?
Also, I'm trying to understand the code that you have suggested. Within the capturing parentheses you have [^"]++. What does this mean exactly? I read it as match anything that's not a double quote. I'm sure that I am wrong. Also, I don't know what the ++ does. I have only ever used one + to indicate "match at least once". What does the double plus, ++, mean? How would I say (in English) what is going on with [^"]++?
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
| [reply] [d/l] |
|
|
|
|
You could consider using a parser.
This extracts the first five bits of data you're after from a fragment of the first url you listed. Sure, there is a lot of it but, imo, this way will be easier to write/maintain and easily adapted should the HTML change (which it will).
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;
my $html = do{local $/;<DATA>};
my $t = HTML::TreeBuilder->new_from_content(
$html
) or die qq{cant parse string\n};
my $td = $t->look_down(
q{_tag} => q{td},
q{class} => q{marketWatch},
q{style} => qr{width:\s+20px;},
) or die qq{cant find td\n};
my $script = $td->look_down(
q{_tag} => q{script},
) or die qq{cant find script\n};
my $js = $script->as_HTML;
my ($ugid, $marketid) = $js =~ /ugid=([^&]+)&marketId=([^&]+)&/;
printf qq{%-10s: %s\n}, q{ugid}, $ugid;
printf qq{%-10s: %s\n}, q{marketId}, $marketid;
$td = $t->look_down(
q{_tag} => q{td},
q{style} => qr{width:\s+22%},
) or die qq{cant find name\n};
printf qq{%-10s: %s\n}, q{name}, $td->as_trimmed_text;
$td = $t->look_down(
q{_tag} => q{td},
q{class} => q{qty},
) or die qq{cant find qty\n};
printf qq{%-10s: %s\n}, q{qty}, $td->as_trimmed_text;
$td = $t->look_down(
q{_tag} => q{td},
q{class} => q{price},
) or die qq{cant find qty\n};
printf qq{%-10s: %s\n}, q{price}, $td->as_trimmed_text;
# gratuitous white space wiped
__DATA__
<td class="marketWatch" style="width: 20px;"><p class="pngfix" id="mar
+ket-watch-1"> </p>
<script type="text/javascript">
$('market-watch-1').onclick=function(){location.href='/marketWatch.h
+tml?ugid=696398000&marketId=700213356&action=add';};
</script>
</td>
<td style="width: 25%;">Super Bowl XLIII</td>
<td style="width: 22%;">
<a href="/marketplace/sports/nfl/super-bowl-xliii/upper-deck-end-zon
+e/1/ravens">Upper Deck End Zone</a>
</td>
<td class="qty">
<input type="hidden" id="qty_select_1" value="2"/>2
</td>
<td class="price" style="width: 18%;">
<input type="hidden" id="trade_price_1" value="25" />
<span class="price">$25.00</span>
</td>
ugid : 696398000
marketId : 700213356
name : Upper Deck End Zone
qty : 2
price : $25.00
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|