Re: Form Security

You should look closely at HTML::Scrubber and HTML::Strip. Solving this with regexes yourself is difficult and error prone. It's also fairly easy to build your own on the back of something like XML::LibXML if wanted. Here's something to play with-

use warnings;
use strict;
use XML::LibXML;

my @strip = @ARGV;
@strip || die "Give a list of tags to strip.\n";

my $parser = XML::LibXML->new();
$parser->recover(1);
$parser->keep_blanks(1);
$parser->line_numbers(1);

my $raw = join '', <DATA>;
my $doc = $parser->parse_html_string($raw);

my $root = $doc->documentElement();

for my $strip ( @strip )
{
    for my $node ( $root->findnodes("//$strip") )
    {
        my $fragment = $doc->createDocumentFragment();
        $fragment->appendChild($_) for $node->childNodes;
        $node->replaceNode($fragment);        
    }
}

# entire HTML doc: print $doc->serialize(1);

print $_->serialize(1) for $doc->findnodes("//body/*");

__END__
<div>
    <h1>Bang!<sup>1</sup></h1>
    <p>Did <i>italic</i> and <a href="/uri">link with <b>bold</b>
       inside it</a>.</p>
<script type="whatever/whatnot">doSomethingTerrible()</script>
 <a href="/top-level">naked link</a>
    <p><i>The</i> <b>content</b> of the body <sup>element</sup> is
       displayed in your <span>browser</span>.</p>
</div>
[download]

The nice thing about this snippet is that it only removes the disallowed tags, not the content within the tags.

Comment on Re: Form Security Download Code