Perl_Necklace has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow seekers. First-time poster here. I need assistance in figuring out why this sub returns an undefined $rows after 5 loops. What this sub should accomplish is to hit a webpage, parse the content from the tables, find a link at the bottom whose text is "more", and keep iterating until that "more" link doesn't exist. There are 10 pages it should be grabbing, but I get a TableExtract error after the 5th page. Data is written to my file successfully up to that point.
use strict; use warnings; use diagnostics; use HTML::TableExtract; use WWW::Mechanize; use Time::HiRes; my $random = rand(10); my $huffdata = "C:/huff_data.txt"; open (MYFILE, "+>>", "$huffdata") or die "unable to open $huffdata $!" +; my $url = "http://fundrace.huffingtonpost.com/neighbors.php?type=name& +lname=SMITH"; sub parse_and_save{ sleep($random); my $mech = WWW::Mechanize->new; $mech->get($url); my $text = $mech->content; my $te = HTML::TableExtract->new( headers => [qw(Donor Contribution Ad +dress)] );; $te->parse($text); my $row; foreach $row ($te->rows) { print MYFILE join(",", @$row); } my @links = $mech->find_link( text_regex => qr/more/i ) or die "no lin +ks found"; for (@links){ $url = $_->url_abs($/); parse_and_save(); } } parse_and_save();

Replies are listed 'Best First'.
Re: web scraping script help
by pemungkah (Priest) on Sep 18, 2007 at 21:25 UTC
    I started out by going to the site and looking it over. There are a couple situations that your code doesn't handle well:
    1. If there's an error on the site, it throws up an "oops" page; usually reloading the URL will get it back on track.
    2. If you follow the "more", from time to time a "featured" entry is thrown in (probably to mess with automated scripts like this, frankly)
    In addition, the parsing subroutine keeps doing a recursive call to itself to move on to the next page. There's nothing wrong with that, but I decided to restructure the sub as an inline script because there's only one call to it.

    I also looked at the URLs. Turns out that the paging mechanism is simply add off= with the next hundred entries. When there aren't any more, a "No Contributions Found" message is displayed. So all you really have to do is keep generating URLs that match the pattern and then stop when you see the "No Contributions" message.

    use strict; use warnings; use diagnostics; use HTML::TableExtract; use WWW::Mechanize; use Time::HiRes; $|++; my $huffdata = "huff_data.txt"; open (my $fh, "+>>", "$huffdata") or die "unable to open $huffdata $!" +; my $mech = WWW::Mechanize->new; # Skeleton URL my $url = "http://fundrace.huffingtonpost.com/neighbors.php?type=name& +lname=SMITH"; my $pagecount = 0; my $off=0; print "Page "; SLURP: while (1) { sleep(rand(10)); my $text; # Build URL for the currently-required page. off=0 works fine. my $current_url = "$url&off=" . ($off*100); # Stay in the loop until we've successfully gotten the page or die +d. my $need_to_get = 1; while ($need_to_get) { $mech->get($current_url); die "Failed to get $current_url on page $pagecount: @{[$mech-> +status]}" unless $mech->success; # We got another page. $pagecount++; print "$pagecount ..."; $text = $mech->content; # Successfully ran out of entries. Blow out of BOTH loops. if ($text =~ /No Contributions Found/) { last SLURP; } # Hiccup at site. Try this one again. if ($text =~ /An error occurred in processing your request/sm) + { print "(oops)"; next; } # Try to parse the table. Reload if this fails. (Takes care of + "featured"). my $te; eval { $te = HTML::TableExtract->new( headers => [qw(Donor Contri +bution Address)] ); $te->parse($text); }; if ($@) { print "(parse failed: $@) "; next; } my @rows = eval { $te->rows }; if ($@) { print "(extract failed: $@) "; next; } # Add a newline to make sure the entries are actually separate +d. foreach my $row ($te->rows) { print $fh join(",", @$row),"\n"; } # Done with this one; drop out of the retry loop. $need_to_get = 0; } # Move up to the next page. $off++; }
    This worked in a test on my machine. I removed the "C:" because my Mac didn't like it.
      Thanks, pemungkah for your suggestions. I also added a foreach loop around everything to iterate through a names array to create my url. It seems to work thus far.
Re: web scraping script help
by n8g (Sexton) on Sep 18, 2007 at 20:28 UTC
    What is the error code produced? If it works on the first few pages I might suspect an HTML issue on the page in question is causing a problem.
      This is the error:
      Can't call method "rows" on an undefined value at C:/Perl/site/lib/HTML/TableExtract.pm line 224 (#1) (F) You used the syntax of a method call, but the slot filled by t +he object reference or package name contains an undefined value. Som +ething like this will reproduce the error: $BADREF = undef; process $BADREF 1,2,3; $BADREF->process(1,2,3); Uncaught exception from user code: Can't call method "rows" on an undefined value at C:/Perl/site/lib +/HTML/TableExtract.pm line 224. at C:/Perl/site/lib/HTML/TableExtract.pm line 224 HTML::TableExtract::rows('HTML::TableExtract=HASH(0x23c6ff0)') cal +led at C:\parse_and_save.pl line 21 main::parse_and_save() called at C:\Documents and Settings\My Docu +ments\perl\scripts\parse_and_save.pl line 28 main::parse_and_save() called at C:\Documents and Settings\My Docu +ments\perl\scripts\parse_and_save.pl line 28 main::parse_and_save() called at C:\Documents and SettingsMy Docum +ents\perl\scripts\parse_and_save.pl line 28 main::parse_and_save() called at C:\Documents and SettingsMy Docum +ents\perl\scripts\parse_and_save.pl line 28 main::parse_and_save() called at C:\Documents and SettingsMy Docum +ents\perl\scripts\parse_and_save.pl line 28 main::parse_and_save() called at C:\Documents and Settings\My Docu +ments\perl\scripts\parse_and_save.pl line 28 main::parse_and_save() called at C:\Documents and SettingsMy Docum +ents\perl\scripts\parse_and_save.pl line 32
        You may want to add a check to make sure it is finding the table before you parse the rows.