Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

It looks like the 2 main module distros that people use for parsing html are HTML-Tree (HTML::TreeBuilder) and HTML-Parser.

The HTML-Tree distribution was last updated in 2006. Is it still a good choice?

The HTML-Tree tutorial was written in 2003 and is quite short (and doesn't directly use HTML::TreeBuilder).

The HTML::TokeParser tutorial was written in 2001. Aside from its age, it also has no comments. Is it still accurate?

Are there any current and complete tutorials about for either of these modules? If not, could maybe the Monastery use a refreshed tutorial or two?

  • Comment on Request: Current and more complete HTML parse parsing tutorial

Replies are listed 'Best First'.
Re: Request: Current and more complete HTML parse parsing tutorial
by desemondo (Hermit) on Apr 20, 2010 at 22:33 UTC
    not trying to be funny, but if you think the tutorials are a little out of date, and are trying to learn the ropes in that area, then you're actually in a pretty good position to update them and give a little something back to the monastery ;) Just post your draft in Meditations with "RFC: <title>" in the title field.

    That still leaves the question of which one is better. Since both are included in ActivePerl releases now core modules* (at least as of 5.10.1) they should both be reasonable for most tasks.

    * Small assumption there, my apologies.
      Since both are now core modules (at least as of 5.10.1)

      I've got 5.10.1 installed and don't see those anywhere. Also, they are not listed at http://perldoc.perl.org/index-modules-H.html. It doesn't look they are actually core Perl modules.

      Thanks for the advice about writing a tutorial. May try that.

Re: Request: Current and more complete HTML parse parsing tutorial
by blakew (Monk) on Apr 21, 2010 at 05:04 UTC
    Have you taken a look at the tutorials in the HTML::Tree distribution? Most relevant, HTML::Tree::Scanning? I have found both the tutorials and documentation in both distributions you mention pretty satisfactory.

      I couldn't make heads or tails of the HTML-Tree documentation, however, I didn't notice that HTML::Tree::Scanning is an article. Will read it. Thanks!

Re: Request: Current and more complete HTML parse parsing tutorial
by Anonymous Monk on Apr 23, 2010 at 20:24 UTC
    You might also want to have a look at XML::LibXML which includes an HTML parser.