in reply to web scraping script help

I started out by going to the site and looking it over. There are a couple situations that your code doesn't handle well:
  1. If there's an error on the site, it throws up an "oops" page; usually reloading the URL will get it back on track.
  2. If you follow the "more", from time to time a "featured" entry is thrown in (probably to mess with automated scripts like this, frankly)
In addition, the parsing subroutine keeps doing a recursive call to itself to move on to the next page. There's nothing wrong with that, but I decided to restructure the sub as an inline script because there's only one call to it.

I also looked at the URLs. Turns out that the paging mechanism is simply add off= with the next hundred entries. When there aren't any more, a "No Contributions Found" message is displayed. So all you really have to do is keep generating URLs that match the pattern and then stop when you see the "No Contributions" message.

use strict; use warnings; use diagnostics; use HTML::TableExtract; use WWW::Mechanize; use Time::HiRes; $|++; my $huffdata = "huff_data.txt"; open (my $fh, "+>>", "$huffdata") or die "unable to open $huffdata $!" +; my $mech = WWW::Mechanize->new; # Skeleton URL my $url = "http://fundrace.huffingtonpost.com/neighbors.php?type=name& +lname=SMITH"; my $pagecount = 0; my $off=0; print "Page "; SLURP: while (1) { sleep(rand(10)); my $text; # Build URL for the currently-required page. off=0 works fine. my $current_url = "$url&off=" . ($off*100); # Stay in the loop until we've successfully gotten the page or die +d. my $need_to_get = 1; while ($need_to_get) { $mech->get($current_url); die "Failed to get $current_url on page $pagecount: @{[$mech-> +status]}" unless $mech->success; # We got another page. $pagecount++; print "$pagecount ..."; $text = $mech->content; # Successfully ran out of entries. Blow out of BOTH loops. if ($text =~ /No Contributions Found/) { last SLURP; } # Hiccup at site. Try this one again. if ($text =~ /An error occurred in processing your request/sm) + { print "(oops)"; next; } # Try to parse the table. Reload if this fails. (Takes care of + "featured"). my $te; eval { $te = HTML::TableExtract->new( headers => [qw(Donor Contri +bution Address)] ); $te->parse($text); }; if ($@) { print "(parse failed: $@) "; next; } my @rows = eval { $te->rows }; if ($@) { print "(extract failed: $@) "; next; } # Add a newline to make sure the entries are actually separate +d. foreach my $row ($te->rows) { print $fh join(",", @$row),"\n"; } # Done with this one; drop out of the retry loop. $need_to_get = 0; } # Move up to the next page. $off++; }
This worked in a test on my machine. I removed the "C:" because my Mac didn't like it.

Replies are listed 'Best First'.
Re^2: web scraping script help
by Perl_Necklace (Initiate) on Sep 19, 2007 at 16:10 UTC
    Thanks, pemungkah for your suggestions. I also added a foreach loop around everything to iterate through a names array to create my url. It seems to work thus far.