How would I extract body from an html page

corpx has asked for the wisdom of the Perl Monks concerning the following question:

Currently, I have this

$tree = HTML::Tree->new();
$tree->parse($child_page);
$body = $tree->look_down( '_tag' , 'body' );
[download]

However, the returning string has the <body> tags in it. How could I get the contents inside the <body> tags using HTML::Tree? I realize that regex is a possibility, but I was told that parsing html using regex was not a good idea.

Comment on How would I extract body from an html page Download Code

Replies are listed 'Best First'.
Re: How would I extract body from an html page by ikegami (Patriarch) on Jul 21, 2009 at 21:30 UTC
I faced the same problem once upon a time. In the DOM, what you want is named "innerHTML". Unfortunately, HTML::Element doesn't provide such a function. I ended up extracting a bit of code from HTML::Element. use strict; use warnings; use HTML::Entities qw( encode_entities ); use HTML::Tagset qw( ); use HTML::TreeBuilder qw( ); use Object::Destroyer qw( ); sub extract_html { # It would be better if we had access to the unparsed text, # but this will do as long as the parser doesn't change. my $html = ''; local *helper = sub { my ($node) = @_; if (!ref($node)) { $html .= encode_entities($node); return; } my $tag = $node->tag(); $html .= $node->starttag(); helper($_) for $node->content_list(); $html .= $node->endtag() if !$HTML::Tagset::emptyElement{$tag} && !$HTML::Tagset::optionalEndTag{$tag}; }; my $node = @_ ? $_[0] : $_; helper($_) for $node->content_list(); return $html; } { my $tree = HTML::TreeBuilder->new(); $tree = Object::Destroyer->new($tree, 'delete'); $tree->parse_content(<<'__EOI__'); <html> <head> <title>Foo</title> </head> <body> <h1>Foo</h1> <p>Not bar </body> </html> __EOI__ print extract_html( $tree->look_down( '_tag' , 'body' ) ); } [download] `<h1>Foo</h1><p>Not bar` [download]	[reply] [d/l] [select]
Re: How would I extract body from an html page by wfsp (Abbot) on Jul 22, 2009 at 06:56 UTC
HTML::Element's `$ele->detach_content` could help you with that. `#! /usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $t = HTML::TreeBuilder->new_from_content(do{local $/;<DATA>}); my $body = $t->look_down(_tag => q{body}); my @content = $body->detach_content; print $_->as_HTML for @content; __DATA__ <html> <head><title>title</title><head> <body> <h1>heading one</h1> <p>paragraph <b>bold</b></p> <p>paragraph</p> </body> </html>` [download] `<h1>heading one</h1> <p>paragraph <b>bold</b> <p>paragraph` [download] See also rhesa's snippet for a discussion on optional tags and xhtml empty tags if this is a concern.	[reply] [d/l] [select]
Re^2: How would I extract body from an html page by corpx (Acolyte) on Jul 25, 2009 at 19:34 UTC
Thanks guys :)	[reply]