SneakZa has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys, Im trying to parse html and capture only the body excluding the body tags,, Now below is the code ive come up with , borrowed from another post actaully, but what im trying to do in the start, text and end handlers is assign the text to a new varible with no luck, if i print my varible in each sub routine it prints but outside of that its empty .. any idea on how else I could capture this data ?
my $p = HTML::Parser->new( api_version => 3 ); $p->handler( start => \&start_handler, "self,tagname,attr" ); $p->parse($content); exit; my $inner_body; sub start_handler { my $self = shift; my $tagname = shift; my $attr = shift; my $text = shift; my $inner_body; return unless ( $tagname eq 'body' ); $self->handler( start => sub { my ($text) = @_; $inner_bod +y = $inner_body. $text; }, "text" ); $self->handler( text => sub { my ($text) = @_; $inner_body = $in +ner_body. $text; }, "text" ); $self->handler( end => sub { my ($endtagname, $self, $text) = @_; if($endtagname eq $tagname) { $self->eof; } else { $inner_body = $inner_body. $text; } }, "tagname,self,text"); } print $inner_body;

Replies are listed 'Best First'.
Re: html::parse inner body html
by tangent (Parson) on May 29, 2013 at 21:44 UTC
    I'm not exactly sure what you want to end up with but this script will capture all the raw text within the body tags:
    my $inner_body = ''; my $in_body = 0; my $Parser = HTML::Parser->new( api_version => 3, handlers => [ start => [\&start_handler, "tagname"], text => [\&text_handler, "text"], end => [\&end_handler, "tagname"], ], ); $Parser->parse($content); $Parser->eof(); print $inner_body; sub start_handler { my $tagname = shift; return unless ( $tagname eq 'body' ); $in_body = 1; } sub text_handler { my $text = shift; return unless $in_body; $inner_body .= $text; } sub end_handler { my $tagname = shift; return unless ( $tagname eq 'body' ); $in_body = 0; }
      Hi I was looking to get the raw text with all the html tags excluding the body tags?? possible
        There's probably a simpler way to do that but this will do what you ask:
        my $Parser = HTML::Parser->new( api_version => 3, handlers => [ start => [\&start_handler, 'tagname,text'], text => [\&text_handler, "text"], end => [\&end_handler, "tagname,text"], ], ); $Parser->parse($content); $Parser->eof(); print $inner_body; sub start_handler { my $tagname = shift; if ( $tagname eq 'body' ) { $in_body = 1; return; } return unless $in_body; my $text = shift; $inner_body .= $text; } sub text_handler { my $text = shift; return unless $in_body; $inner_body .= $text; } sub end_handler { my $tagname = shift; if ( $tagname eq 'body' ) { $in_body = 0; return; } return unless $in_body; my $text = shift; $inner_body .= $text; }