Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I'm looking for some help on how to use HTML::Parser.My goal is quite straight forward, I simply want to strip all HTML-tags off my scalar $html,leaving just the plain text in $parsed_html.A helping hand would be greatly appriciated. Regards Johan.

Replies are listed 'Best First'.
Re: Parsing with HTML::Parser
by valdez (Monsignor) on Feb 21, 2003 at 10:04 UTC

    May I suggest to use HTML::TokeParser::Simple written by Ovid?

    # grabbed from man page use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; print $token->as_is; }

    Ciao, Valerio

Re: Parsing with HTML::Parser
by steves (Curate) on Feb 21, 2003 at 10:28 UTC

    Here's an old sub I have that does that, with a flag to optionally remove newlines.

    sub text_only { my $content = shift; my $rm_newlines = shift; my $parser; my %inside; my $text; my $tag = sub { my ($tag_name, $num) = @_; $inside{$tag_name} += $num; $text .= " "; }; my $get_text = sub { $text .= $_[0] if ( !$inside{script} && !$inside{style} ); }; $parser = HTML::Parser->new(handlers => [start => [$tag, "tagname, + '+1'"], end => [$tag, "tagname, + '-1'"], text => [$get_text, "dte +xt"], ], marked_sections => 1) or die croak "Failed to create HTML::Parser object: $!\n"; $parser->parse($content); $text =~ s/[\n\r]/ /g if ($rm_newlines); return $text; }
Re: Parsing with HTML::Parser
by thunders (Priest) on Feb 21, 2003 at 17:06 UTC

    That task seems like it might be handled better with a more specific tool than HTML::Parser.

    Both HTML::TagFilter and HTML::CGIChecker offer fuctions for removing unwanted HTML tags, and offer simple ways to allow certain safe tags to filter through if you want that.

Re: Parsing with HTML::Parser
by cees (Curate) on Feb 21, 2003 at 14:41 UTC

    The above answers are excellent, but if you are looking to present this text in a clean formatted way, you might want to look at some of the html2txt programs that are floating around on the internet. Debian has one prepackaged that I have used and works quite well.

    Of course this is useless if you just want the data, and don't care about the formatting.

Try HTML::TokeParser::Simple
by xtype (Deacon) on Feb 23, 2003 at 01:51 UTC
    I am a little surprised that no one has suggested this before me.
    use LWP::Simple qw($ua get head); use HTML::TokeParser::Simple; my $webpage = "http://some-url.com"; $ua->timeout(30); my ($html, $parsed_html); if (head($webpage)) { $html = get $webpage || return 0; } else { return 0; } my $p = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $p->get_token ) { next unless $token->is_text; $parsed_html .= $token->as_is; }
    update: Woops, guess I did not read the first post completely. I posted nearly the same code.