cdherold has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I'm stymied once again ... I'm pulling links off a webpage at http://biz.yahoo.com/rf/archive.html. For a long time (about a year) I had a cronjob set to go to this page every hour and retrieve all the links and check the link contents.

All the sudden the program stopped working. So I shut it down. Now I'm going back to see what the problem might be. When i run the program after it hasn't run for days or hours it pulls the links down fine and completes its run. Buuuttttt, when i try to run it two times in a row (even with a 10 min break in between) ... it won't go the second time. Is the website keeping track of my program and blocking it from repeated visits somehow? If so how would i fix this? Any other ideas what might be happening?

Muchos Gracias Monks. I appreciate all your help.

chris

  • Comment on CGI to Pull links off webpage fails on second run

Replies are listed 'Best First'.
Re: CGI to Pull links off webpage fails on second run
by dws (Chancellor) on Apr 10, 2003 at 21:18 UTC
    Is the website keeping track of my program and blocking it from repeated visits somehow?

    If you can hit this site twice in a row with your browser, then chances are good that they're doing one of several things:

    • The site might be issuing a cookie on your first request, recording your visit, then noticing that you're not supplying the cookie on a subsequent request. Is your script detecting any cookies?
    • The site might be noticing that you request the first page, but not images (including ads). The site then "flunks" you on subsquent requests for the home page. Some sites have gotten aggressive about making sure that their ads are seen. Try having your script fetch (and quietly ignore) images, .js files, .css files, etc., so that your script behaves more like a browser. Alternatively, try setting your browser to not fetch images, and see if you notice different behavior through the browser.
    • They might be throttling based on browser type. Is your script pretending to be a well-known browser, or are you letting LWP supply a default value?

    Ah. You posted code. Good. Try adding   $ua->agent("Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"); to pretend that your script is a real browser.

      I have disabled images and selected the option to refuse cookies (and have done so) for this page. I am still able to access repeatedly through the browser.

      I assume my LWP is only supplying the default value for the browser since i do not yet know how to specify otherwise. Do you have easy access to the code to do that? If not I can go find out myself and test it.

      thanks for the feedback. update: just got the code. I will go try it.

        To set the useragent:
        $ua->agent("foo/0.42");
      I just ran it with the

      $ua->agent("Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");

      It now completes (although not nearly as fast as when it completes after the program hasn't been run in a while) but it doesn't retrieve any links. Prints out "Links: " and that's it.

        It now completes ... but it doesn't retrieve any links.

        Time to print out what they are returning. Chances are it's some variant on

        <html><body>Gotcha!</body></html>
Re: CGI to Pull links off webpage fails on second run
by MrCromeDome (Deacon) on Apr 10, 2003 at 21:06 UTC
    Could be any number of things. It's really hard to say without some code to look at (nudge nudge). . .

    MrCromeDome

      a little code ... this is just the link retrieval section, but it alone will not run two times in a row (except over 30+ minutes between runs).

      $url = "http://biz.yahoo.com/rf/archive.html"; $ua = new LWP::UserAgent; # Set up a callback that collect links my @links = (); sub callback { my($tag, %attr) = @_; return if $tag ne 'a'; # only look closer at written d +ocuments, not images push(@links, values %attr); + } # Make the parser. $p = HTML::LinkExtor->new(\&callback); # Request document and parse it as it arrives $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); #Expand all URLs to absolute ones my $base = $res->base; @links = map { $_ = url($_, $base)->abs; } @links; print "Links: <P>@links<p>"; exit;
Re: CGI to Pull links off webpage fails on second run
by The Mad Hatter (Priest) on Apr 10, 2003 at 22:02 UTC
    It would probably be a good idea to check what data you are getting back. To test, take out the link extractor and just print the raw HTML you get back from the request. I'd try that for both useragents (the default and the MSIE one below) and see what gets printed.

    Don't know if this is any indication of you being blocked, but using your code you posted, I can run the script successfully at least 5 or 6 times (that's all I tried) in a row.