comment on

If the HTML you are parsing happens to be valid XML (they call that XHTML these days ;)) then you can use the uber module XML::Twig:

use strict;
use XML::Twig;

my $t = XML::Twig->new(
   twig_handlers => {table => \&handler},
   pretty_print => 'indented',
);
$t->parse(\*DATA);

sub handler {
   my($t,$table) = @_;
   $table->flush 
      if $table->att('border') == 0
      and $table->att('align') eq 'center'
   ;
}

__DATA__
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>XML::Twig table extract test</title>
</head>
<body>
<table><tr><td>
   <table border="0" align="center">
   <tr align="center">
   <td colspan="3"><a href="/a.gif"><img src="/a.png" alt="a" />
   </a></td>
   </tr>
   </table>
</td></tr></table>
</body>
</html>
[download]

But, the HTML you have posted is not valid XHTML. I ran your HTML through HTML Tidy and embedded it inside another table for testing purposes. You can always fetch the web page and call HTML Tidy externally, or you can install XML::LibXML and use the technique merlyn presents at HTML tidy, using XML::LibXML to clean up the HTML you have to parse.

So, why use something like XML::Twig instead of an HTML::Parser? Because you are extracted out a whole subset of HTML instead of individual tags or text. Another good candidate module for this kind of work is XML::XPath. The XPath language was designed to "address parts of an XML document". Here is a quick example that uses XPath (and the same DATA filehandle):

use strict;
use XML::XPath;

my $xpath = XML::XPath->new(ioref => \*DATA);
my $nodes = $xpath->find('//table[@border=0][@align="center"]');
print XML::XPath::XMLParser::as_string($nodes->get_nodelist);
[download]

How's that for 5 lines of code? ;) You can find a good XPath tutorial at http://www.zvon.org/xxl/XPathTutorial/General/examples.html, by the way.

It is important to use the right tool for the job, and i think that XML::Twig and XML::XPath are better tools for this job than the HTML parsers.

Hope this helps :)

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

In reply to Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by jeffa
in thread HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? by Starman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.