comment on

I am attempting to extract some data from an HTML page using the Web::Scraper module. The HTML looks as shown below:

<table class="dextable" align="center">
<tr>
    <td class="fooevo">ID No.</td>
    <td class="fooevo">Picture</td>
    <td class="fooevo">Pok&eacute;mon Name</td>
    <td class="fooevo">Rarity</td>
    <td class="fooevo">Movement</td>
    <td class="fooevo">Material Cost</td>
</tr>
<tr>
    <td class="cen">ID - 26</td>
    <td class="cen"><a href="figures/26-poliwag.shtml"><img src="/duel
+/figures/th/26.jpg" alt="Poliwag" title="Poliwag" border="0" /></a></
+td>
    <td class="fooinfo"><a href="figures/26-poliwag.shtml"><u>Poliwag<
+/u></a></td>
    <td class="cen"><img src="/duel/c.png" /> C</td>
    <td class="cen">3</td>
    <td class="fooinfo"><img src="/duel/material.png" />250</td>
</tr>
[download]

I need to extract the URL in the href attribute of the 2nd td element as well as the text of the title attribute of the same (2nd) td element. The code I have written for this is as shown below:

#!/usr/bin/perl -w
use URI;
use Web::Scraper;
use Encode;

# First, create your scraper block
my $p1 = scraper {
    process 'table[class="dextable"] td[class="cen"]', "list[]" => scr
+aper {
      # And, in each td,
      # get the URI of "a" element 
      process_first "a", uri => '@href';
      # get text inside "u" element
      process_first "a", name => '@title';
    };
};

my $res = $p1->scrape( URI->new("http://serebii.net/duel/figures.shtml
+") );

for my $p (@{$res->{list}}) {
    print Encode::encode("utf8", "$p->{name}\t$p->{uri}\n");
}
[download]

The code shown above does not work. It prints the URL correctly but not the name. Also, it seems to be picking up other td elements that don't have a nested <a> element in them and for those td elements, it again displays an error. So, to summarize, I guess I'm looking for answers to two questions:

How do I get the Web::Scraper module to extract the name attribute?
How do I get the Web::Scraper module to ignore those td elements without a nested <a> element in them?

Thank you in advance.

In reply to Using Web::Scraper to extract content from an HTML page by SiteScraper

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.