There have been many questions posted to Perlmonks recently asking about cleaning up HTML (either removing specific tags, or removing all tags except for a given few). Most people respond with one of two suggestions:
  1. a regular expression - this has the problem that it may not work because of the way > and < may appear in the HTML
  2. advice to check out <cpan://HTML::Parser> and use it as the basis for solving the problem
I've decided to delve in write some code that would serve as an example of how to properly filter out unwanted HTML tags from a document. I actually use <cpan://HTML::Filter> which is distributed with <cpan://HTML::Parser>. My code uses a hash of tags to keep; it could be easily adapted to work with a hash of tags to drop instead.

As always; any comments, criticism or advice on doing this better is appreciated.

package HTML::Sanitizer; require HTML::Filter; @ISA=qw(HTML::Filter); my $data=''; my %keep=( a => 1, p => 1, img => 1 ); sub output{ my $self=shift; my $d=$_[0]; if($d=~/\<\s*\/?\s*(\w+)/){ if(exists $keep{lc($1)}){ $data.=$d; } }else{ $data.=$d; } } my $p=HTML::Sanitizer->new(); $p->parse_file("index.html"); print $data;

In reply to HTML Sanitizer (removes unwanted tags) by lhoward

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.