HTML::TreeBuilder

Item Description: Parser that builds a HTML syntax tree

Review Synopsis:

I've been happily using this module for a few months. If you dislike code that (ab)uses regular expressions to parse HTML, this module could be what you're looking for!

TreeBuilder uses HTML::Parser under the hood, and at the moment is fairly tightly coupled to HTML::Element, since it builds a tree of those objects if the parse is successful. (The author spoke recently on the libwww mailing list about making the module capable of building a tree of, say, subclassed HTML::Elements.)

The killer feature of this module is that it tries to parse HTML as a browser would, rather than treating all input HTML as supposedly perfectly compliant documents---which the majority of them are not! This is extremely useful. I have not seen a HTML parser for any other language that does anything like this.

Even though you'll use HTML::TreeBuilder, most of the functionality you'll want to use is in HTML::Element. The look_down() method is very useful---called on an Element, it searches down the tree looking for Elements that match a list of criteria. It's possible to specify a code reference as an argument (other forms of arguments are supported); Elements that pass the sub are returned (actually, in scalar context the first such Element is returned). Since look_down (and its sister, look_up, among many others) returns an Element, it's easy to search on successively more specific criteria for just what you want, and the code (written correctly) will keep working even if the HTML changes (I've used this pretty successfully to deduce the form contents required to fake a HTTPS login to HotMail---I'd post it here but there is too much LWP clutter in the way of what should be presented to show how this module shines).

The module also provides Tree cloning, cutting, and splicing functionality, much like you'd expect from a Document Object Model in other languages (or even Perl!). TreeBuilder objects can be converted to and from HTML and XML Element trees using the HTML::DOMbo module, by the same author. (I haven't used this functionality myself...yet.)

There are a few slight downsides to the module---at the moment it can't be usefully subclassed (a very minor problem); it's probably not as fast as searching your HTML with a regex; it may not even be as fast as `grepping' through parsed HTML via HTML::Parser directly. However I had to work with it quite extensively before I found any of these things even slightly problematic.

The author, Sean M. Burke <sburke@spinn.net>, maintains the code well, and is ready to answer questions on the LWP mailing list.

An excellent module that anyone dealing with HTML should become familiar with.