CGI to Pull links off webpage fails on second run

cdherold has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: CGI to Pull links off webpage fails on second run by dws (Chancellor) on Apr 10, 2003 at 21:18 UTC
Is the website keeping track of my program and blocking it from repeated visits somehow? If you can hit this site twice in a row with your browser, then chances are good that they're doing one of several things: The site might be issuing a cookie on your first request, recording your visit, then noticing that you're not supplying the cookie on a subsequent request. Is your script detecting any cookies? The site might be noticing that you request the first page, but not images (including ads). The site then "flunks" you on subsquent requests for the home page. Some sites have gotten aggressive about making sure that their ads are seen. Try having your script fetch (and quietly ignore) images, .js files, .css files, etc., so that your script behaves more like a browser. Alternatively, try setting your browser to not fetch images, and see if you notice different behavior through the browser. They might be throttling based on browser type. Is your script pretending to be a well-known browser, or are you letting LWP supply a default value? Ah. You posted code. Good. Try adding `$ua->agent("Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");` to pretend that your script is a real browser.	[reply] [d/l]
Re: Re: CGI to Pull links off webpage fails on second run by cdherold (Monk) on Apr 10, 2003 at 21:32 UTC
I have disabled images and selected the option to refuse cookies (and have done so) for this page. I am still able to access repeatedly through the browser. I assume my LWP is only supplying the default value for the browser since i do not yet know how to specify otherwise. Do you have easy access to the code to do that? If not I can go find out myself and test it. thanks for the feedback. update: just got the code. I will go try it.	[reply]
Re: Re: Re: CGI to Pull links off webpage fails on second run by The Mad Hatter (Priest) on Apr 10, 2003 at 21:35 UTC
To set the useragent: `$ua->agent("foo/0.42");` [download]	[reply] [d/l]
Re: Re: CGI to Pull links off webpage fails on second run by cdherold (Monk) on Apr 10, 2003 at 21:48 UTC
I just ran it with the `$ua->agent("Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");` [download] It now completes (although not nearly as fast as when it completes after the program hasn't been run in a while) but it doesn't retrieve any links. Prints out "Links: " and that's it.	[reply] [d/l]
Re: Re: Re: CGI to Pull links off webpage fails on second run by dws (Chancellor) on Apr 11, 2003 at 04:02 UTC
It now completes ... but it doesn't retrieve any links. Time to print out what they are returning. Chances are it's some variant on `<html><body>Gotcha!</body></html>` [download]	[reply] [d/l]
Re: CGI to Pull links off webpage fails on second run by MrCromeDome (Deacon) on Apr 10, 2003 at 21:06 UTC
Could be any number of things. It's really hard to say without some code to look at (nudge nudge). . . MrCromeDome	[reply]
Re: Re: CGI to Pull links off webpage fails on second run by cdherold (Monk) on Apr 10, 2003 at 21:13 UTC
a little code ... this is just the link retrieval section, but it alone will not run two times in a row (except over 30+ minutes between runs). $url = "http://biz.yahoo.com/rf/archive.html"; $ua = new LWP::UserAgent; # Set up a callback that collect links my @links = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'a'; # only look closer at written d +ocuments, not images push(@links, values %attr); + } # Make the parser. $p = HTML::LinkExtor->new(\&callback); # Request document and parse it as it arrives $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); #Expand all URLs to absolute ones my $base = $res->base; @links = map { $_ = url($_, $base)->abs; } @links; print "Links: <P>@links<p>"; exit; [download]	[reply] [d/l]
Re: CGI to Pull links off webpage fails on second run by The Mad Hatter (Priest) on Apr 10, 2003 at 22:02 UTC
It would probably be a good idea to check what data you are getting back. To test, take out the link extractor and just print the raw HTML you get back from the request. I'd try that for both useragents (the default and the MSIE one below) and see what gets printed. Don't know if this is any indication of you being blocked, but using your code you posted, I can run the script successfully at least 5 or 6 times (that's all I tried) in a row.	[reply]