If the HTML you are parsing happens to be valid XML (they
call that XHTML these days ;)) then you can use the uber
module XML::Twig:
use strict;
use XML::Twig;
my $t = XML::Twig->new(
twig_handlers => {table => \&handler},
pretty_print => 'indented',
);
$t->parse(\*DATA);
sub handler {
my($t,$table) = @_;
$table->flush
if $table->att('border') == 0
and $table->att('align') eq 'center'
;
}
__DATA__
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>XML::Twig table extract test</title>
</head>
<body>
<table><tr><td>
<table border="0" align="center">
<tr align="center">
<td colspan="3"><a href="/a.gif"><img src="/a.png" alt="a" />
</a></td>
</tr>
</table>
</td></tr></table>
</body>
</html>
But, the HTML you have posted is not valid XHTML. I ran your
HTML through HTML Tidy and
embedded it inside another table for testing purposes. You
can always fetch the web page and call HTML Tidy externally,
or you can install XML::LibXML and use the
technique merlyn presents at HTML tidy, using XML::LibXML to clean up the
HTML you have to parse.
So, why use something like XML::Twig instead of an
HTML::Parser? Because you are extracted out a whole subset
of HTML instead of individual tags or text. Another good
candidate module for this kind of work is
XML::XPath. The XPath
language was designed to "address parts of an XML document".
Here is a quick example that uses XPath (and the same
DATA filehandle):
use strict;
use XML::XPath;
my $xpath = XML::XPath->new(ioref => \*DATA);
my $nodes = $xpath->find('//table[@border=0][@align="center"]');
print XML::XPath::XMLParser::as_string($nodes->get_nodelist);
How's that for 5 lines of code? ;) You can find a good XPath tutorial at http://www.zvon.org/xxl/XPathTutorial/General/examples.html,
by the way.
It is important to use the right tool for the job, and i
think that XML::Twig and XML::XPath are
better tools for this job than the HTML parsers.
Hope this helps :)
jeffa
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)
|