jupe has asked for the wisdom of the Perl Monks concerning the following question:

I have been investigating how to parse in-the-wild HTML with various Perl modules, only to become a tad dismayed. What I need to do is be able to parse a document, and retrieve the contents of specific block level elements. For example, in Javascript I would write something like:
var elem=document.getElementsByClass('className'); for (var i=0; i<elem.length; i++) { alert(elem.innerHTML); }
But I can't seem to find a sensible way of doing this in Perl. With HTML::Parser I've set up start_h and end_h event handlers, but the end_h event handler isn't smart enough to know when the specific associated end tag has appeared--so if it handles a specific <div>, I can't figure out a way to only send an event on the closing of that </div> and not any embedded ones. I am sure I'm missing something simple (wrong module likely), but my brain is frazzled enough I would appreciate some guidance. Thanks!!!

Replies are listed 'Best First'.
Re: Parsing semi-complex HTML
by ikegami (Patriarch) on Jul 07, 2010 at 00:24 UTC

    With XML::LibXML, it would be

    for my $node ($doc->findnodes('//*[@class="className"]')) { print($node->toString()); }

    If you want to use HTML::Parser (e.g. if the HTML isn't valid), don't use it directly. Use HTML::TreeBuilder instead. It creates a tree of HTML::Element objects, whose look_down and as_HTML method you could use.

      Actually, I've never had problems using XML::LibXML on broken HTML:

      use XML::LibXML; my $parser = XML::LibXML->new(); $parser->recover(1); $parser->recover_silently(1); my $doc = $parser->parse_html_string($stuff);

      Good Day,
          Dean

        Thanks, good to know! I never tried.
Re: Parsing semi-complex HTML
by Anonymous Monk on Jul 07, 2010 at 00:37 UTC
    #!/usr/bin/perl -- use strict; use warnings; use HTML::DOM; my $dom_tree = HTML::DOM->new; $dom_tree->parse_file($filename); for my $node ( $dom_tree->getElementsByClassName('className') ) { print $node->innerHTML; }
Re: Parsing semi-complex HTML
by ikegami (Patriarch) on Jul 07, 2010 at 00:26 UTC

    With XML::LibXML, it would be

    for my $node ($doc->findnodes('//*[@class="className"]')) { print($node->toString()); }

    If you want to use HTML::Parser (e.g. if the HTML isn't valid), don't use it directly. Use HTML::TreeBuilder instead. It creates a tree of HTML::Element objects, whose look_down and as_HTML method you could use.

Re: Parsing semi-complex HTML
by Anonymous Monk on Jul 07, 2010 at 00:42 UTC
    #!/usr/bin/perl -- use strict; use warnings; use pQuery; pQuery("http://some.uri.com/") ->find("*[class~=className]") ->each(sub { my $i = shift; print $i + 1, ") ", pQuery($_)->html(), "\n"; }); __END__