corpx has asked for the wisdom of the Perl Monks concerning the following question:

Currently, I have this
$tree = HTML::Tree->new(); $tree->parse($child_page); $body = $tree->look_down( '_tag' , 'body' );
However, the returning string has the <body> tags in it. How could I get the contents inside the <body> tags using HTML::Tree? I realize that regex is a possibility, but I was told that parsing html using regex was not a good idea.

Replies are listed 'Best First'.
Re: How would I extract body from an html page
by ikegami (Patriarch) on Jul 21, 2009 at 21:30 UTC
    I faced the same problem once upon a time. In the DOM, what you want is named "innerHTML". Unfortunately, HTML::Element doesn't provide such a function. I ended up extracting a bit of code from HTML::Element.
    use strict; use warnings; use HTML::Entities qw( encode_entities ); use HTML::Tagset qw( ); use HTML::TreeBuilder qw( ); use Object::Destroyer qw( ); sub extract_html { # It would be better if we had access to the unparsed text, # but this will do as long as the parser doesn't change. my $html = ''; local *helper = sub { my ($node) = @_; if (!ref($node)) { $html .= encode_entities($node); return; } my $tag = $node->tag(); $html .= $node->starttag(); helper($_) for $node->content_list(); $html .= $node->endtag() if !$HTML::Tagset::emptyElement{$tag} && !$HTML::Tagset::optionalEndTag{$tag}; }; my $node = @_ ? $_[0] : $_; helper($_) for $node->content_list(); return $html; } { my $tree = HTML::TreeBuilder->new(); $tree = Object::Destroyer->new($tree, 'delete'); $tree->parse_content(<<'__EOI__'); <html> <head> <title>Foo</title> </head> <body> <h1>Foo</h1> <p>Not bar </body> </html> __EOI__ print extract_html( $tree->look_down( '_tag' , 'body' ) ); }
    <h1>Foo</h1><p>Not bar
Re: How would I extract body from an html page
by wfsp (Abbot) on Jul 22, 2009 at 06:56 UTC
    HTML::Element's $ele->detach_content could help you with that.
    #! /usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new_from_content(do{local $/;<DATA>}); my $body = $t->look_down(_tag => q{body}); my @content = $body->detach_content; print $_->as_HTML for @content; __DATA__ <html> <head><title>title</title><head> <body> <h1>heading one</h1> <p>paragraph <b>bold</b></p> <p>paragraph</p> </body> </html>
    <h1>heading one</h1> <p>paragraph <b>bold</b> <p>paragraph
    See also rhesa's snippet for a discussion on optional tags and xhtml empty tags if this is a concern.
      Thanks guys :)