The chief problem is HTML written thus:
<<ILLEGAL_TAG>ILLEGAL_TAG>...<</ILLEGAL_TAG>/ILLEGAL_TAG>
The tag-stripper only works one-level deep, which is an insecurity of about 100% of tag-strippers I've ever seen.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
(jeffa) Re: Tag-Stripper is Insecure
by jeffa (Bishop) on Feb 01, 2002 at 15:40 UTC
    Hi - super naive question here. How about:
    $foo =~ s/</&lt;/g;
    Instead of stripping the tags, just 'disable' them.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Not to belittle your idea, but "duh". The best course of action is not to strip the tags, but to merely render them as actual text. You do need to s/>/&gt;/g too, or else I'll find a way to terminate a comment early or something wacky like that.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      jeffa++, I think this is a good idea for lots of reasons beyond fixing this particular problem.

              - tye (but my friends call me "Tye")
      Note that this does not fix everything. For example some older browsers actually saw \x8b and \x9b as < and > respectively, and rendered HTML tags if you used these characters. This is covered in CERT's XSS Vulnerability document.

      Luckily it's hard to find a browser vulnerable to this any more. But it's still something to watch for when trying to catch XSS vulnerabilities (the important thing here is to send the character set (encoding) along with the content-type header).

      Also it's naive at best to suggest just allowing text. Most systems want to accept some form of HTML. The thing to do is make sure you do allowed tags, not disallowed tags. And never allow attributes. That's just asking for trouble.

      I've used that technique a lot when we didn't want message posters to be allowed to use any html tags at all.

      Michael

Re (tilly) 1: Tag-Stripper is Insecure
by tilly (Archbishop) on Feb 01, 2002 at 18:16 UTC
Re: Tag-Stripper is Insecure
by gav^ (Curate) on Feb 01, 2002 at 18:27 UTC

    A HTML::Parser based tag stripper (like the one I posted here) can handle this.

    You just need to make sure you escape the '>' and '<' in the text handler (as that is what the nested tags will be treated as).

    I haven't seen any non-HTML::Parser tag strippers that don't have one problem or another.

    gav^

Re: Tag-Stripper is Insecure
by dws (Chancellor) on Feb 01, 2002 at 22:08 UTC
    The chief problem is HTML written thus: <<ILLEGAL_TAG>ILLEGAL_TAG>...<</ILLEGAL_TAG>/ILLEGAL_TAG>
    I have code that converts a user-supplied chunk of text into renderable HTML. It allows a restricted subset of tags. Instead of converting the string in-place, I pick pieces off incrementally, doing something that looks like:
    emit($1), next if m/\G([^<>&]+/gc; emit($1), next if m/\G(&\w+;)/gc; emit("&lt;"), next if m/\G<(?!<)/gc; # handle potentially valid REs here emit("&lt;"), next if m/\G</gc; emit("&gt;"), next if m/\G>/gc;
    The first RE handles text. The second RE handles entities. The third RE prevents a sneak attack. In the case of the above fragment, it leaves   &lt;ILLEGAL_TAG&gt;...&lt;/ILLEGAL_TAG&gt; And yeah, yeah, my hands should be swatted for rolling my own HTML parsing code, but it was written several years ago before there were as many options, and it works well for what I use it for.

Re: Tag-Stripper is Insecure (boo)
by boo_radley (Parson) on Feb 01, 2002 at 17:14 UTC
    iirc, something like  1 while {code to strip tags here} would do the trick.
    But as you say, "duh".