in reply to More efficient use of HTML::TokeParser::Simple

If the HTML you are processing is modest in size then you might consider HTML::TreeBuilder which allows you to search for elements using various match criteria and may clean up code where you want to skip about the document.


DWIM is Perl's answer to Gödel
  • Comment on Re: More efficient use of HTML::TokeParser::Simple

Replies are listed 'Best First'.
Re^2: More efficient use of HTML::TokeParser::Simple
by henka (Novice) on Jul 11, 2006 at 06:17 UTC
    I poked around HTML::TreeBuilder, but my goodness, things are complicated. It may not seem like it to seasoned monks, but to a C programmer, the OO aspects and data structures of perl are, well, daunting. Gleaning how to do something as simple as the one I posted here from the perl module docs is almost always an excercise in frustration.

      Here's a trivial example that seems to do something like what you want and may be enough to get you started with TreeBuilder:

      use warnings; use strict; use HTML::TreeBuilder; my $html = do {local $/; <DATA>}; my $tree = HTML::TreeBuilder->new (); $tree->parse ($html); $tree->eof (); $tree->elementify(); my ($title) = $tree->find ('title'); my @h1 = $tree->find ('h1'); print $title->as_text (), "\n"; print $_->as_text (), "\n" for @h1; __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <!-- Took this out for IE6ites "http://www.w3.org/TR/REC-html40/loose. +dtd" --> <html lang="en"> <head> <title>More efficient use of HTML::TokeParser::Simple perlquestion + id:560199</title> </head> <body> <h1>Header 1</h1> <p>First paragraph</p> <h1>Header 2</h1> <p>Second paragraph</p> <h2>Level 2 header 1</h2> </body> </html>

      Prints:

      More efficient use of HTML::TokeParser::Simple perlquestion id:560199 Header 1 Header 2

      DWIM is Perl's answer to Gödel
        What does
        $tree->elementify();
        do here? It appears to run ok if it is commented out. I've often seen it in snippets and have no idea what purpose it serves.