There have been many questions posted to Perlmonks
recently asking
about cleaning up HTML
(either removing specific tags, or removing
all tags except for a given few). Most people respond
with one of two suggestions:
- a regular expression - this has the problem that it may not work because of the way > and < may appear in the HTML
- advice to check out <cpan://HTML::Parser> and use it as the basis for solving the problem
As always; any comments, criticism or advice on doing this better is appreciated.
package HTML::Sanitizer; require HTML::Filter; @ISA=qw(HTML::Filter); my $data=''; my %keep=( a => 1, p => 1, img => 1 ); sub output{ my $self=shift; my $d=$_[0]; if($d=~/\<\s*\/?\s*(\w+)/){ if(exists $keep{lc($1)}){ $data.=$d; } }else{ $data.=$d; } } my $p=HTML::Sanitizer->new(); $p->parse_file("index.html"); print $data;
|
---|
Replies are listed 'Best First'. | |
---|---|
RE: HTML Sanitizer (removes unwanted tags)
by merlyn (Sage) on Aug 08, 2000 at 07:20 UTC | |
Re: HTML Sanitizer (removes unwanted tags)
by ehdonhon (Curate) on Apr 20, 2005 at 01:10 UTC | |
by lhoward (Vicar) on Apr 20, 2005 at 10:58 UTC |
Back to
Cool Uses for Perl