While I am its maintainer, thus biased, I firmly believe that HTML::HTML5::Parser is the best Perl HTML parser on the block. It's perhaps somewhat slower than HTML::Parser but because it uses the same HTML5 parsing algorithm found in modern web browsers, it should do a better job on tag soup.
Whatsmore, it parses the HTML into an XML::LibXML DOM tree, which I firmly believe is the best XML DOM for Perl (even though it's not pure Perl - it's based on libxml2 which is implemented in C).
I'm also the author of Web::Magic which aims to integrate the two modules mentioned above with LWP::UserAgent and various other things to provide a "do what I mean" solution for interacting with RESTful HTTP resources. Here's an example using Web::Magic...
use 5.010; use Web::Magic; say Web::Magic -> new('http://www.perlmonks.org/', node_id => 974112) -> querySelector('title') -> textContent
And here's an advantage of how you'd do something similar without Web::Magic...
use 5.010; use HTML::HTML5::Parser; my $xml = HTML::HTML5::Parser->load_html( location => 'http://www.perlmonks.org/?node_id=974112' ); my $nodes = $xml->findnodes('//*[local-name()="title"]'); say $nodes->get_node(1)->textContent;
In reply to Re: extracting data from HTML
by tobyink
in thread extracting data from HTML
by Jurassic Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |