gebelo has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a scraping project and have successfully created a loop that searches a Web page using a list of keywords, one by one, and then saves the results to a file, clicking the "next page" button when the results don't fit on a single page.
BUT... the script always stops before it gets through all the records... This error message would be appended to the dump file: 500 Connect failed: connect: Unknown error; Unknown error So I took a stab at using the $mech->success function and managed to eliminate the error message completely -- but the script still crashes. What I think I need to do is after every click, check to see if the response is a valid Web page, and if not, click the reload button, so here's my code:
open fileOUT, ">> searchresults.htm"; print fileOUT $ua->response->content; close fileOUT; # loop through the rest while ($ua->response->content =~ m/nextbut/i) { $ua->form_name( 'nextbut' ); $ua->click; die "I'm failing ", $ua->reload unless $ua->success; sleep 15; open(fileOUT, ">>searchresults.htm"); print fileOUT $ua->response->content; close(fileOUT);}
Any ideas -- up to an including an entirely different approach to the problem?

Replies are listed 'Best First'.
Re: What's the best way to use $mech->success?
by tlm (Prior) on Mar 28, 2005 at 21:52 UTC
    $ua->click; die "I'm failing ", $ua->reload unless $ua->success; sleep 15;

    What are you up to with that? It looks to me like the script will die (or "crash" as you put it) the first time that $ua->success fails to return true. Is that what you want? From your description it sounds like what you want is something more like

    while ($ua->response->content =~ m/nextbut/i) { until ( $ua->click->is_success ) { warn "Submission failed: I'll sleep for a while"; sleep 15; $ua->reload; } # etc., etc. }

    the lowliest monk

Re: What's the best way to use $mech->success?
by cbrandtbuffalo (Deacon) on Mar 28, 2005 at 21:24 UTC
    $mech->status gives you the HTTP status code returned. You could probably see more clearly what was happening by checking this code.

    The simple test is a 200 is a good page, everything else is an error of some sort. You could then do different things based on the different errors.