Manipulating plaintext within HTML

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to s/// on the plaintext in a variable containing HTML. If naively I do $html =~ s/PERL/Perl/g, lots of links and HTML elements become broken. I just want the text to be replaced.

I looked on CPAN and found HTML::FormatText, which converts the HTML to plaintext. But it won't convert back to HTML after I do the substitution. Can I use HTML::Parser, HTML::TokeParser, HTML::TokeParser::Simple, or HTML::TreeBuilder to identify the plaintext, manipulate it how I see fit, and reassemble the structure into the original HTML, with plaintext modifications? If so, how?

I looked at Sean M. Burke's article in TPJ, and it mentions the as_text() method, but I'm at wits end how to put the modified text back in the HTML document. Can someone provide an example? I just want to apply a regex to the plain text portion of $html, how can I do this? Thanks in advance, -jc

Comment on Manipulating plaintext within HTML Download Code

Replies are listed 'Best First'.
Re: Manipulating plaintext within HTML by Ovid (Cardinal) on Jun 26, 2003 at 19:37 UTC
If you use HTML::TokeParser::Simple, just pass a reference to the variable to the constructor. `use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new(\$html); my $new_html = ''; while (my $token = $parser->get_token) { my $text = $token->as_is; $new_html .= $token->is_text ? munge_text($text) : $text; } print $new_html; sub munge_text { # put your text munging stuff here }` [download] Cheers, Ovid New address of my CGI Course. Looking for work. Here's my resume. Will work for food (plus salary).	[reply] [d/l]
Re: Re: Manipulating plaintext within HTML by Anonymous Monk on Jun 26, 2003 at 20:33 UTC
Thanks, thats perfect. :) Your module is pretty nifty. (If anyone is curious, I'm matching arbitrary English words with the regex (?<!&)(a-zA-Z'+). The look-behind assertion skips HTML entities, leaving only true English words.) Thanks again, -jc	[reply]