comment on

I was playing around with this and I see that you may not need to parse the HTML. Basically, we're not stripping or evaluating HTML, we're trying to shut it down. jeffa had the right idea by substituting < with <. I wrote a regex that might be a start for you:

$data =~ s/
          <                 # First '<'
           (?!              # Not followed by (Everything in this list
+ is allowed)
            (?:             # (with non-grouping parens)
             \/?br>         # A break tag
             |              # or
             \/?p>          # A paragraph tag
             |              # or
             \/?font[^>]*>  # A font tag
             |              # or
             \/?h[1-6]>     # A headline
            )               # Close non-grouping parens
           )                # End of negative lookahead
           (                # Capture to $1
            [^>]*           # Everything until the final '>'
           )                # End capture
          >                 # Final '>'
          /&lt;$1&gt;/gsix;
[download]

This regex handles the closing and ending tags. It substitutes out matched pairs of angle brackets and will ignore individual ones. I haven't tested it in depth, but I would probably want to play with this and see, with mismatched angle brackets and server side includes, if I could sneak something past this.

If you want to allow more HTML, just add the allowable elements in the negative lookahead list. This only allows very simple tags and has the benefit of you stating what you will allow, as opposed to stating what you won't allow (which has the risk of you overlooking something).

Also note that you want the entire document in the variable. If you run this line by line, someone could break the HTML up over several lines and beat the regex.

And for those who prefer it on one line:

$data =~ s/<(?!(?:\/?br>|\/?p>|\/?font[^>]*>|\/?h[1-6]>))([^>]*)>/&lt;
+$1&gt;/gsi;
[download]

Cheers,
Ovid

Ovid patiently waits to be blasted for this one.

In reply to (Ovid) Maybe you don't need to parse the HTML by Ovid
in thread BBS HTML fitler by tkroll

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.