How to use HTML::Parser to encode text with HTML entities?

locust has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks!

I have a string that contains an HTML file. What I'd like to do is first decode any HTML entities contained in the text (only!) and then encode the text with entities that I can specify. What I want returned is the entire string, in the same order that it was in, with just the text encoded with HTML entities.

I have assumed that using a combination of HTML::Parser and HTML::Entities is the best way to achieve my goal, but if you have a better way, then let me here it

Anyhow, anyone know how to do this? I don't have much experience with HTML::Parser, and the documentation is not really clear to me on how to do this.

Thanks

Update

I used the HTML::TokeParser::Simple module and HTML::Entities to get the solution:

use HTML::Entities;
use HTML::TokeParser::Simple;

my $html = <some file>; #this is shorthand for example..assume the Fil
+e has been opened in slurp mode

my $parsed = parseHTML($html);

sub parseHTML {
    my $html = shift;
    my $parsed;
    my $p = HTML::TokeParser::Simple->new(\$html);

    while ( my $token = $p->get_token ) {
        # This prints all text in an HTML doc (i.e., it strips the HTM
+L)
        if ($token->is_text) {
            my $text = $token->as_is;
            encode_entities($text, '",' );
            $parsed .= $text;
        } else {
            $parsed .= $token->as_is;
        }
     }

    return $parsed;
}
[download]

Thanks!

Comment on How to use HTML::Parser to encode text with HTML entities? Download Code

Replies are listed 'Best First'.
Re: How to use HTML::Parser to encode text with HTML entities? by Your Mother (Archbishop) on Dec 01, 2010 at 18:09 UTC
Re: Strip HTML tags again and my follow-up are probably exactly what you want. They use HTML::TokeParser::Simple to parse and HTML::Tagset to decide what is "text" and what is supposed to be a tag. HTML::Parser is a bit bare bones and I wouldn't recommend it over the TokeParser modules (or HTML::TreeBuilder).	[reply]
Re^2: How to use HTML::Parser to encode text with HTML entities? by Anonymous Monk on Dec 01, 2010 at 21:34 UTC
and my follow-up are probably exactly what you want. Naive approach, treats the text between script/style tags as plaintext. See http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.68/eg/htext and HTML::StripScripts, HTML stripper..., strip HTML tags	[reply]
Re^3: How to use HTML::Parser to encode text with HTML entities? by Your Mother (Archbishop) on Dec 01, 2010 at 22:22 UTC
It wasn't given here as the way to do stripping but as an approach to simple parsing with custom tags toward any end.	[reply]