SiteScraper has asked for the wisdom of the Perl Monks concerning the following question:
I need to extract the URL in the href attribute of the 2nd td element as well as the text of the title attribute of the same (2nd) td element. The code I have written for this is as shown below:<table class="dextable" align="center"> <tr> <td class="fooevo">ID No.</td> <td class="fooevo">Picture</td> <td class="fooevo">Pokémon Name</td> <td class="fooevo">Rarity</td> <td class="fooevo">Movement</td> <td class="fooevo">Material Cost</td> </tr> <tr> <td class="cen">ID - 26</td> <td class="cen"><a href="figures/26-poliwag.shtml"><img src="/duel +/figures/th/26.jpg" alt="Poliwag" title="Poliwag" border="0" /></a></ +td> <td class="fooinfo"><a href="figures/26-poliwag.shtml"><u>Poliwag< +/u></a></td> <td class="cen"><img src="/duel/c.png" /> C</td> <td class="cen">3</td> <td class="fooinfo"><img src="/duel/material.png" />250</td> </tr>
The code shown above does not work. It prints the URL correctly but not the name. Also, it seems to be picking up other td elements that don't have a nested <a> element in them and for those td elements, it again displays an error. So, to summarize, I guess I'm looking for answers to two questions:#!/usr/bin/perl -w use URI; use Web::Scraper; use Encode; # First, create your scraper block my $p1 = scraper { process 'table[class="dextable"] td[class="cen"]', "list[]" => scr +aper { # And, in each td, # get the URI of "a" element process_first "a", uri => '@href'; # get text inside "u" element process_first "a", name => '@title'; }; }; my $res = $p1->scrape( URI->new("http://serebii.net/duel/figures.shtml +") ); for my $p (@{$res->{list}}) { print Encode::encode("utf8", "$p->{name}\t$p->{uri}\n"); }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Using Web::Scraper to extract content from an HTML page
by tangent (Parson) on Apr 04, 2017 at 01:02 UTC | |
|
Re: Using Web::Scraper to extract content from an HTML page
by beech (Parson) on Apr 03, 2017 at 22:58 UTC | |
by SiteScraper (Initiate) on Apr 03, 2017 at 23:32 UTC | |
by beech (Parson) on Apr 04, 2017 at 00:24 UTC | |
|
Re:Using Web::Scraper to extract content from an HTML page
by SiteScraper (Initiate) on Apr 04, 2017 at 21:17 UTC |