in reply to HTML::Parser fun

I haven't read your code closely, but I get the feeling that you're trying to use HTML::Parser to extract text below tags, especially the text contained in <a> tags together with the href attribute. HTML::Parser is a rather unwieldly tool to extract text below tags in my opinion - I prefer to use XPath expressions for such tasks. Depending on what tools you have available, you might want to use XML::LibXML or Web::Scraper or HTML::TreeBuilder::XPath (which is what Web::Scraper uses) for running the XPath expressions. I write the link+text extraction using Web::Scraper:

use strict; use Web::Scraper; use Data::Dumper; my $html = join "", <DATA>; # Invoked for a <a> tag my $link = scraper { process '//a' => 'href' => '@href'; process '//a' => 'description' => 'TEXT'; }; my $page = scraper { process '//a[@href]' => 'links[]' => $link; process '//meta[@http-equiv]' => 'meta[]' => '@content'; process '//area[@href]' => 'areas[]' => '@href'; }; my $info = $page->scrape($html); print Dumper $info; __DATA__ __DATA__ ... your html ...

Update: After some quick browsing of the CPAN for XPath, I found XML::XPathEngine, by mirod, and quite unsurprisingly XML::Twig already understands XPath expressions. So it should be quite feasible to jury-rig Web::Scraper to use XML::Twig instead of HTML::TreeBuilder as the underlying parsing engine. Or you might just use XML::Twig directly.

Replies are listed 'Best First'.
Re^2: HTML::Parser fun
by FreakyGreenLeaky (Sexton) on Jun 04, 2008 at 12:47 UTC
    Thanks for the suggestions, will check it out. I'm using HTML::Parser for performance reasons. Everything else that I've tried is several orders of magnitude slower.

      Of course it's important to arrive at the wrong answer as fast as possible :). Most likely, the solutions are all slow because they load the HTML into the DOM, which is slow for large enough HTML files.

      On the other hand, I had to look at your output, because I couldn't follow your code for what you want to extract and what not. Your code hides the rules on what to extract quite deep, while the XPath expressions reduce the code mostly to the extraction rules and some boilerplate. Maybe you can keep the speed and gain some expressiveness by using a SAX-based parser like XML::Twig, which is meant for applying downward rules while not loading the whole document.

        Hmm, XML::Twig looks interesting, thanks!

        HTML::Parser is probably overkill for this simple task. I use it elsewhere to extract all HTML tags and their content, etc, and there it's performance is excellent (we're processing hundreds of millions of HTML docs, hence my need for speed).

      I have no benchmarks but would logically expect XML::LibXML to be as fast or faster than HTML::Parser. They're both C and libxml is more mature with more eyeballs involved. The only issue I see is that while it can parse some broken HTML, it's not as flexible in that regard as HTML::Parser.