I am attempting to extract some data from an HTML page using the Web::Scraper module. The HTML looks as shown below:
<table class="dextable" align="center">
<tr>
<td class="fooevo">ID No.</td>
<td class="fooevo">Picture</td>
<td class="fooevo">Pokémon Name</td>
<td class="fooevo">Rarity</td>
<td class="fooevo">Movement</td>
<td class="fooevo">Material Cost</td>
</tr>
<tr>
<td class="cen">ID - 26</td>
<td class="cen"><a href="figures/26-poliwag.shtml"><img src="/duel
+/figures/th/26.jpg" alt="Poliwag" title="Poliwag" border="0" /></a></
+td>
<td class="fooinfo"><a href="figures/26-poliwag.shtml"><u>Poliwag<
+/u></a></td>
<td class="cen"><img src="/duel/c.png" /> C</td>
<td class="cen">3</td>
<td class="fooinfo"><img src="/duel/material.png" />250</td>
</tr>
I need to extract the URL in the href attribute of the 2nd td element as well as the text of the title attribute of the same (2nd) td element.
The code I have written for this is as shown below:
#!/usr/bin/perl -w
use URI;
use Web::Scraper;
use Encode;
# First, create your scraper block
my $p1 = scraper {
process 'table[class="dextable"] td[class="cen"]', "list[]" => scr
+aper {
# And, in each td,
# get the URI of "a" element
process_first "a", uri => '@href';
# get text inside "u" element
process_first "a", name => '@title';
};
};
my $res = $p1->scrape( URI->new("http://serebii.net/duel/figures.shtml
+") );
for my $p (@{$res->{list}}) {
print Encode::encode("utf8", "$p->{name}\t$p->{uri}\n");
}
The code shown above does not work. It prints the URL correctly but not the name. Also, it seems to be picking up other td elements that don't have a nested <a> element in them and for those td elements, it again displays an error. So, to summarize, I guess I'm looking for answers to two questions:
- How do I get the Web::Scraper module to extract the name attribute?
- How do I get the Web::Scraper module to ignore those td elements without a nested <a> element in them?
Thank you in advance.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.