Parsing semi-complex HTML

jupe has asked for the wisdom of the Perl Monks concerning the following question:

I have been investigating how to parse in-the-wild HTML with various Perl modules, only to become a tad dismayed. What I need to do is be able to parse a document, and retrieve the contents of specific block level elements. For example, in Javascript I would write something like:

var elem=document.getElementsByClass('className');
for (var i=0; i<elem.length; i++) {
  alert(elem.innerHTML);
}
[download]

But I can't seem to find a sensible way of doing this in Perl. With HTML::Parser I've set up start_h and end_h event handlers, but the end_h event handler isn't smart enough to know when the specific associated end tag has appeared--so if it handles a specific <div>, I can't figure out a way to only send an event on the closing of that </div> and not any embedded ones. I am sure I'm missing something simple (wrong module likely), but my brain is frazzled enough I would appreciate some guidance. Thanks!!!

Comment on Parsing semi-complex HTML Download Code

Replies are listed 'Best First'.
Re: Parsing semi-complex HTML by ikegami (Patriarch) on Jul 07, 2010 at 00:24 UTC
With XML::LibXML, it would be `for my $node ($doc->findnodes('//*[@class="className"]')) { print($node->toString()); }` [download] If you want to use HTML::Parser (e.g. if the HTML isn't valid), don't use it directly. Use HTML::TreeBuilder instead. It creates a tree of HTML::Element objects, whose `look_down` and `as_HTML` method you could use.	[reply] [d/l] [select]
Re^2: Parsing semi-complex HTML by duelafn (Parson) on Jul 07, 2010 at 15:57 UTC
Actually, I've never had problems using XML::LibXML on broken HTML: `use XML::LibXML; my $parser = XML::LibXML->new(); $parser->recover(1); $parser->recover_silently(1); my $doc = $parser->parse_html_string($stuff);` [download] Good Day, Dean	[reply] [d/l]
Re^3: Parsing semi-complex HTML by ikegami (Patriarch) on Jul 07, 2010 at 16:11 UTC
Thanks, good to know! I never tried.	[reply]
Re^4: Parsing semi-complex HTML by Your Mother (Archbishop) on Jul 07, 2010 at 16:30 UTC
Re^5: Parsing semi-complex HTML by ikegami (Patriarch) on Jul 07, 2010 at 16:40 UTC
Re: Parsing semi-complex HTML by Anonymous Monk on Jul 07, 2010 at 00:37 UTC
`#!/usr/bin/perl -- use strict; use warnings; use HTML::DOM; my $dom_tree = HTML::DOM->new; $dom_tree->parse_file($filename); for my $node ( $dom_tree->getElementsByClassName('className') ) { print $node->innerHTML; }` [download]	[reply] [d/l]
Re: Parsing semi-complex HTML by ikegami (Patriarch) on Jul 07, 2010 at 00:26 UTC
With XML::LibXML, it would be `for my $node ($doc->findnodes('//*[@class="className"]')) { print($node->toString()); }` [download] If you want to use HTML::Parser (e.g. if the HTML isn't valid), don't use it directly. Use HTML::TreeBuilder instead. It creates a tree of HTML::Element objects, whose `look_down` and `as_HTML` method you could use.	[reply] [d/l] [select]
Re: Parsing semi-complex HTML by Anonymous Monk on Jul 07, 2010 at 00:42 UTC
`#!/usr/bin/perl -- use strict; use warnings; use pQuery; pQuery("http://some.uri.com/") ->find("*[class~=className]") ->each(sub { my $i = shift; print $i + 1, ") ", pQuery($_)->html(), "\n"; }); __END__` [download]	[reply] [d/l]