This regex handles the closing and ending tags. It substitutes out matched pairs of angle brackets and will ignore individual ones. I haven't tested it in depth, but I would probably want to play with this and see, with mismatched angle brackets and server side includes, if I could sneak something past this.$data =~ s/ < # First '<' (?! # Not followed by (Everything in this list + is allowed) (?: # (with non-grouping parens) \/?br> # A break tag | # or \/?p> # A paragraph tag | # or \/?font[^>]*> # A font tag | # or \/?h[1-6]> # A headline ) # Close non-grouping parens ) # End of negative lookahead ( # Capture to $1 [^>]* # Everything until the final '>' ) # End capture > # Final '>' /<$1>/gsix;
If you want to allow more HTML, just add the allowable elements in the negative lookahead list. This only allows very simple tags and has the benefit of you stating what you will allow, as opposed to stating what you won't allow (which has the risk of you overlooking something).
Also note that you want the entire document in the variable. If you run this line by line, someone could break the HTML up over several lines and beat the regex.
And for those who prefer it on one line:
Cheers,$data =~ s/<(?!(?:\/?br>|\/?p>|\/?font[^>]*>|\/?h[1-6]>))([^>]*)>/< +$1>/gsi;
Ovid patiently waits to be blasted for this one.
In reply to (Ovid) Maybe you don't need to parse the HTML
by Ovid
in thread BBS HTML fitler
by tkroll
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |