Re^3: HTML::Parser fun

Of course it's important to arrive at the wrong answer as fast as possible :). Most likely, the solutions are all slow because they load the HTML into the DOM, which is slow for large enough HTML files.

On the other hand, I had to look at your output, because I couldn't follow your code for what you want to extract and what not. Your code hides the rules on what to extract quite deep, while the XPath expressions reduce the code mostly to the extraction rules and some boilerplate. Maybe you can keep the speed and gain some expressiveness by using a SAX-based parser like XML::Twig, which is meant for applying downward rules while not loading the whole document.

Comment on Re^3: HTML::Parser fun Download Code

Replies are listed 'Best First'.
Re^4: HTML::Parser fun by FreakyGreenLeaky (Sexton) on Jun 04, 2008 at 13:14 UTC
Hmm, XML::Twig looks interesting, thanks! HTML::Parser is probably overkill for this simple task. I use it elsewhere to extract all HTML tags and their content, etc, and there it's performance is excellent (we're processing hundreds of millions of HTML docs, hence my need for speed).	[reply]