locust has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks!

I have a string that contains an HTML file. What I'd like to do is first decode any HTML entities contained in the text (only!) and then encode the text with entities that I can specify. What I want returned is the entire string, in the same order that it was in, with just the text encoded with HTML entities.

I have assumed that using a combination of HTML::Parser and HTML::Entities is the best way to achieve my goal, but if you have a better way, then let me here it

Anyhow, anyone know how to do this? I don't have much experience with HTML::Parser, and the documentation is not really clear to me on how to do this.

Thanks

Update

I used the HTML::TokeParser::Simple module and HTML::Entities to get the solution:

use HTML::Entities; use HTML::TokeParser::Simple; my $html = <some file>; #this is shorthand for example..assume the Fil +e has been opened in slurp mode my $parsed = parseHTML($html); sub parseHTML { my $html = shift; my $parsed; my $p = HTML::TokeParser::Simple->new(\$html); while ( my $token = $p->get_token ) { # This prints all text in an HTML doc (i.e., it strips the HTM +L) if ($token->is_text) { my $text = $token->as_is; encode_entities($text, '",' ); $parsed .= $text; } else { $parsed .= $token->as_is; } } return $parsed; }

Thanks!

Replies are listed 'Best First'.
Re: How to use HTML::Parser to encode text with HTML entities?
by Your Mother (Archbishop) on Dec 01, 2010 at 18:09 UTC

        It wasn't given here as the way to do stripping but as an approach to simple parsing with custom tags toward any end.