Parsing with HTML::Parser

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing with HTML::Parser by valdez (Monsignor) on Feb 21, 2003 at 10:04 UTC
May I suggest to use HTML::TokeParser::Simple written by Ovid? `# grabbed from man page use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( $somefile ); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTML) next unless $token->is_text; print $token->as_is; }` [download] Ciao, Valerio	[reply] [d/l]
Re: Parsing with HTML::Parser by steves (Curate) on Feb 21, 2003 at 10:28 UTC
Here's an old sub I have that does that, with a flag to optionally remove newlines. sub text_only { my $content = shift; my $rm_newlines = shift; my $parser; my %inside; my $text; my $tag = sub { my ($tag_name, $num) = @_; $inside{$tag_name} += $num; $text .= " "; }; my $get_text = sub { $text .= $_[0] if ( !$inside{script} && !$inside{style} ); }; $parser = HTML::Parser->new(handlers => [start => [$tag, "tagname, + '+1'"], end => [$tag, "tagname, + '-1'"], text => [$get_text, "dte +xt"], ], marked_sections => 1) or die croak "Failed to create HTML::Parser object: $!\n"; $parser->parse($content); $text =~ s/[\n\r]/ /g if ($rm_newlines); return $text; } [download]	[reply] [d/l]
Re: Parsing with HTML::Parser by thunders (Priest) on Feb 21, 2003 at 17:06 UTC
That task seems like it might be handled better with a more specific tool than HTML::Parser. Both HTML::TagFilter and HTML::CGIChecker offer fuctions for removing unwanted HTML tags, and offer simple ways to allow certain safe tags to filter through if you want that.	[reply]
Re: Parsing with HTML::Parser by cees (Curate) on Feb 21, 2003 at 14:41 UTC
The above answers are excellent, but if you are looking to present this text in a clean formatted way, you might want to look at some of the html2txt programs that are floating around on the internet. Debian has one prepackaged that I have used and works quite well. Of course this is useless if you just want the data, and don't care about the formatting.	[reply]
Try HTML::TokeParser::Simple by xtype (Deacon) on Feb 23, 2003 at 01:51 UTC
I am a little surprised that no one has suggested this before me. `use LWP::Simple qw($ua get head); use HTML::TokeParser::Simple; my $webpage = "http://some-url.com"; $ua->timeout(30); my ($html, $parsed_html); if (head($webpage)) { $html = get $webpage \|\| return 0; } else { return 0; } my $p = HTML::TokeParser::Simple->new( \$html ); while ( my $token = $p->get_token ) { next unless $token->is_text; $parsed_html .= $token->as_is; }` [download] update: Woops, guess I did not read the first post completely. I posted nearly the same code.	[reply] [d/l]