comment on

I've been happily using this module for a few months. If you dislike code that (ab)uses regular expressions to parse HTML, this module could be what you're looking for!

TreeBuilder uses HTML::Parser under the hood, and at the moment is fairly tightly coupled to HTML::Element, since it builds a tree of those objects if the parse is successful. (The author spoke recently on the libwww mailing list about making the module capable of building a tree of, say, subclassed HTML::Elements.)

The killer feature of this module is that it tries to parse HTML as a browser would, rather than treating all input HTML as supposedly perfectly compliant documents---which the majority of them are not! This is extremely useful. I have not seen a HTML parser for any other language that does anything like this.

Even though you'll use HTML::TreeBuilder, most of the functionality you'll want to use is in HTML::Element. The look_down() method is very useful---called on an Element, it searches down the tree looking for Elements that match a list of criteria. It's possible to specify a code reference as an argument (other forms of arguments are supported); Elements that pass the sub are returned (actually, in scalar context the first such Element is returned). Since look_down (and its sister, look_up, among many others) returns an Element, it's easy to search on successively more specific criteria for just what you want, and the code (written correctly) will keep working even if the HTML changes (I've used this pretty successfully to deduce the form contents required to fake a HTTPS login to HotMail---I'd post it here but there is too much LWP clutter in the way of what should be presented to show how this module shines).

The module also provides Tree cloning, cutting, and splicing functionality, much like you'd expect from a Document Object Model in other languages (or even Perl!). TreeBuilder objects can be converted to and from HTML and XML Element trees using the HTML::DOMbo module, by the same author. (I haven't used this functionality myself...yet.)

There are a few slight downsides to the module---at the moment it can't be usefully subclassed (a very minor problem); it's probably not as fast as searching your HTML with a regex; it may not even be as fast as `grepping' through parsed HTML via HTML::Parser directly. However I had to work with it quite extensively before I found any of these things even slightly problematic.

The author, Sean M. Burke <sburke@spinn.net>, maintains the code well, and is ready to answer questions on the LWP mailing list.

An excellent module that anyone dealing with HTML should become familiar with.

In reply to HTML::TreeBuilder by Nooks

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks