Re: HTML::TokeParser not stripping entities and xhtml

Even though HTML::TokeParser is great, you can alleviate some pain by switching to HTML::TokeParser::Simple. If you just want to strip out the malformed tags, here's a first try:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $text = '</tr />
</tbody />
</table /></p>
<p>We have&nbsp;different groups to help you through the buying proces
+s.&nbsp;Our team of counselors and volunteers can provide transportat
+ion, and childcare.&nbsp;</p>
<p>&nbsp;</p>';
my $result = '';
my $p = HTML::TokeParser::Simple->new(\$text);

while ( my $token = $p->get_token ) {
    my $text = $token->as_is;
    if ($token->is_tag) {
        next if $text =~ /^<\/.*\/>$/;
    }
    $result .= $text;
}

print $result;
[download]

Cheers,
Ovid

New address of my CGI Course.

Comment on Re: HTML::TokeParser not stripping entities and xhtml Download Code