I started out by going to the site and looking it over. There are a couple situations that your code doesn't handle well:
  1. If there's an error on the site, it throws up an "oops" page; usually reloading the URL will get it back on track.
  2. If you follow the "more", from time to time a "featured" entry is thrown in (probably to mess with automated scripts like this, frankly)
In addition, the parsing subroutine keeps doing a recursive call to itself to move on to the next page. There's nothing wrong with that, but I decided to restructure the sub as an inline script because there's only one call to it.

I also looked at the URLs. Turns out that the paging mechanism is simply add off= with the next hundred entries. When there aren't any more, a "No Contributions Found" message is displayed. So all you really have to do is keep generating URLs that match the pattern and then stop when you see the "No Contributions" message.

use strict; use warnings; use diagnostics; use HTML::TableExtract; use WWW::Mechanize; use Time::HiRes; $|++; my $huffdata = "huff_data.txt"; open (my $fh, "+>>", "$huffdata") or die "unable to open $huffdata $!" +; my $mech = WWW::Mechanize->new; # Skeleton URL my $url = "http://fundrace.huffingtonpost.com/neighbors.php?type=name& +lname=SMITH"; my $pagecount = 0; my $off=0; print "Page "; SLURP: while (1) { sleep(rand(10)); my $text; # Build URL for the currently-required page. off=0 works fine. my $current_url = "$url&off=" . ($off*100); # Stay in the loop until we've successfully gotten the page or die +d. my $need_to_get = 1; while ($need_to_get) { $mech->get($current_url); die "Failed to get $current_url on page $pagecount: @{[$mech-> +status]}" unless $mech->success; # We got another page. $pagecount++; print "$pagecount ..."; $text = $mech->content; # Successfully ran out of entries. Blow out of BOTH loops. if ($text =~ /No Contributions Found/) { last SLURP; } # Hiccup at site. Try this one again. if ($text =~ /An error occurred in processing your request/sm) + { print "(oops)"; next; } # Try to parse the table. Reload if this fails. (Takes care of + "featured"). my $te; eval { $te = HTML::TableExtract->new( headers => [qw(Donor Contri +bution Address)] ); $te->parse($text); }; if ($@) { print "(parse failed: $@) "; next; } my @rows = eval { $te->rows }; if ($@) { print "(extract failed: $@) "; next; } # Add a newline to make sure the entries are actually separate +d. foreach my $row ($te->rows) { print $fh join(",", @$row),"\n"; } # Done with this one; drop out of the retry loop. $need_to_get = 0; } # Move up to the next page. $off++; }
This worked in a test on my machine. I removed the "C:" because my Mac didn't like it.

In reply to Re: web scraping script help by pemungkah
in thread web scraping script help by Perl_Necklace

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.