Should I just use HTML::Parser and shut up?

Yes!

Here is a filter example to get you going - its really quite easy once you get you head around how it works. I find the pod a little obscure but there are some good tutorials out there.

You should easily see how we check each opening and closing tag and add it if it is on the ok list - parser calls &start for opening tags and &end for closing tags. Similarly we add the text between the OK opening and closing tags as parser calls &text and we have flagged that we do or don't want this text. If you just want the text just don't add the tags. What could be easier?

#!/usr/bin/perl -w

package Filter;
use strict;
use base 'HTML::Parser';

my ($filter, $want_it);
my @ok_tags = qw ( h1 h2 h3 h4 p br );
my %ok_tags;
$ok_tags{$_}++ for @ok_tags;
 
sub start {
    my ($self, $tag, $attr, $attrseq, $origtext) = @_;
    if ( exists $ok_tags{$tag}) {
        $filter .= $origtext;
        $want_it = 1;
    } else {
        $want_it = 0;
    } 
}

sub text {
    my ($self, $text) = @_;
    $filter .= $text if $want_it; 
}

sub comment {
    # uncomment to no strip comments
    # my ($self, $comment) = @_;
    # $filter .= "<!-- $comment -->";
}

sub end {
    my ($self, $tag, $origtext) = @_; 
    $filter .= $origtext if exists $ok_tags{$tag};
}

my $parser = new Filter;
my $html = join '', <DATA>;
$parser->parse($html);
$parser->eof;

print $html;
print "\n\n------------------------\n\n";
print $filter;

__DATA__
<html>
<head>
  <title>Title</title>
</head>
<body>
<h1>Hello Parser</h1>
<p>You need HTML::Parser</p>
<h2>Parser rocks!</h2>
<a href="html.parser.com">html.parser.com</a>
<hr>
<pre>
  use HTML::Parser;
</pre>
<!-- HTML PARSER ROCKS! -->
</body>
</html>
[download]

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

In reply to Re: Tag filtering: a standard mechanism? by tachyon
in thread Tag filtering: a standard mechanism? by thpfft

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.