Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?

by jeffa (Bishop)
on Aug 23, 2003

in reply to HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser?

If the HTML you are parsing happens to be valid XML (they call that XHTML these days ;)) then you can use the uber module XML::Twig:
use strict; use XML::Twig; my $t = XML::Twig->new( twig_handlers => {table => \&handler}, pretty_print => 'indented', ); $t->parse(\*DATA); sub handler { my($t,$table) = @_; $table->flush if $table->att('border') == 0 and $table->att('align') eq 'center' ; } __DATA__ <html xmlns=""> <head> <title>XML::Twig table extract test</title> </head> <body> <table><tr><td> <table border="0" align="center"> <tr align="center"> <td colspan="3"><a href="/a.gif"><img src="/a.png" alt="a" /> </a></td> </tr> </table> </td></tr></table> </body> </html>
But, the HTML you have posted is not valid XHTML. I ran your HTML through HTML Tidy and embedded it inside another table for testing purposes. You can always fetch the web page and call HTML Tidy externally, or you can install XML::LibXML and use the technique merlyn presents at HTML tidy, using XML::LibXML to clean up the HTML you have to parse.

So, why use something like XML::Twig instead of an HTML::Parser? Because you are extracted out a whole subset of HTML instead of individual tags or text. Another good candidate module for this kind of work is XML::XPath. The XPath language was designed to "address parts of an XML document". Here is a quick example that uses XPath (and the same DATA filehandle):

use strict; use XML::XPath; my $xpath = XML::XPath->new(ioref => \*DATA); my $nodes = $xpath->find('//table[@border=0][@align="center"]'); print XML::XPath::XMLParser::as_string($nodes->get_nodelist);
How's that for 5 lines of code? ;) You can find a good XPath tutorial at, by the way.

It is important to use the right tool for the job, and i think that XML::Twig and XML::XPath are better tools for this job than the HTML parsers.

Hope this helps :)


