Hi - super naive question here. How about:
$foo =~ s/</</g;
Instead of stripping the tags, just 'disable' them.
jeffa
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)
| [reply] [d/l] |
| [reply] |
| [reply] |
Note that this does not fix everything. For example some older browsers actually saw \x8b and \x9b as < and > respectively, and rendered HTML tags if you used these characters. This is covered in CERT's XSS Vulnerability document.
Luckily it's hard to find a browser vulnerable to this any more. But it's still something to watch for when trying to catch XSS vulnerabilities (the important thing here is to send the character set (encoding) along with the content-type header).
Also it's naive at best to suggest just allowing text. Most systems want to accept some form of HTML. The thing to do is make sure you do allowed tags, not disallowed tags. And never allow attributes. That's just asking for trouble.
| [reply] |
I've used that technique a lot when we didn't want message posters to be allowed to use any html tags at all.
Michael
| [reply] |
| [reply] |
A HTML::Parser based tag stripper (like the one I posted here) can handle this.
You just need to make sure you escape the '>' and '<' in the text handler (as that is what the nested tags will be treated as).
I haven't seen any non-HTML::Parser tag strippers that don't have one problem or another.
gav^
| [reply] |
| [reply] |
The chief problem is HTML written thus:
<<ILLEGAL_TAG>ILLEGAL_TAG>...<</ILLEGAL_TAG>/ILLEGAL_TAG>
I have code that converts a user-supplied chunk of text into renderable HTML. It allows a restricted subset of tags. Instead of converting the string in-place, I pick pieces off incrementally, doing something that looks like:
emit($1), next if m/\G([^<>&]+/gc;
emit($1), next if m/\G(&\w+;)/gc;
emit("<"), next if m/\G<(?!<)/gc;
# handle potentially valid REs here
emit("<"), next if m/\G</gc;
emit(">"), next if m/\G>/gc;
The first RE handles text. The second RE handles entities. The third RE prevents a sneak attack. In the case of the above fragment, it leaves
<ILLEGAL_TAG>...</ILLEGAL_TAG>
And yeah, yeah, my hands should be swatted for rolling my own HTML parsing code, but it was written several years ago before there were as many options, and it works well for what I use it for.
| [reply] [d/l] [select] |
iirc, something like
1 while {code to strip tags here} would do the trick.
But as you say, "duh". | [reply] [d/l] |