Sanitizing HTML

skx has asked for the wisdom of the Perl Monks concerning the following question:

As part of a online forum I'm setting up I need to display user submitted content, which is stored in a database and then displayed.

Because the site uses cookies for authentification, and as a general preventative measure I wish to strip out dangerous tags, javascript, images etc.

I think that I would be safe just leaving a minimal subset of HTML, such as the following tags P, B, I, A (with only a subset of attributes, HREF and TITLE for example).

I realise that a regular expression approach is unlikely to be workable, so my two choices seem to be HTML::Sanitizer and HTML::Scrubber. Both of these will do the job, without too much effort. (I'm still suprised this isn't done here on the home nodes, maybe its a hard thing to do efficiently? Either that or its not yet been considered important enough)

As they do a real parse of the HTML they rely upon the various parsing modules, HTML::Tree and HTML::Parser respectively.

Is there another approach I'm missing, with less dependencies? Or a simpler system I could use instead?

Whilst I can use either of the two packages above I'm keen on using something that's less hungry - so that I can keep it upto date on my Debian Stable webhost.

Steve
---
steve.org.uk

Comment on Sanitizing HTML

Replies are listed 'Best First'.
Re: Sanitizing HTML by simon.proctor (Vicar) on Sep 29, 2004 at 13:37 UTC
Some forums don't allow any HTML but then use tags based on square brackets - much like the node linking mechanism used here. Perhaps this would provide an easier approach as you can strip out all HTML very easily by encoding everything you get. Picking what you want to allow is as simple as doing something like: `[b][/b] [i][/i] [link target=""]link text[/link]` [download] Of course that can add just as much complexity when it comes to testing for nesting, unclosed tags (etc). But I know of a few forums that still stick with this approach. Take a look at some of the open source ones and see how they tackle this problem (phbb comes to mind). As always YMMV. Parsing and stripping HTML is not a small task but HTML::Parser does it very well so using something based on that shouldn't present a problem.	[reply] [d/l]
Re: Sanitizing HTML by gellyfish (Monsignor) on Sep 29, 2004 at 13:57 UTC
YOu might be interested to look at the whitelist based HTML stripper that we use in the NMS guestbook - you can download the program from http://nms-cgi.sf.net/scripts.shtml. /J\	[reply]
Re: Sanitizing HTML by dragonchild (Archbishop) on Sep 29, 2004 at 13:31 UTC
You want to do the following: Parse a complex data structure Strip out the vast majority of it Do this with a minimum of resource utilization In other words, you sound like my VP when we bring him options. "I want option A's features, option B's schedule, and option C's cost using the development staff needed for option D." This isn't an alacarte menu. If you want to tow a boat, you're not going to use a Pinto. In other words, use the right tool for the job. Worry about optimization later, if ever. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better. I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re^2: Sanitizing HTML by JediWizard (Deacon) on Sep 29, 2004 at 13:47 UTC
Don't you just love managment? You're description of your VP just reminds me of this. May the Force be with you	[reply]
Re^2: Sanitizing HTML by skx (Parson) on Sep 29, 2004 at 13:37 UTC
It's not so much the efficiency I care about as the need to use two modules outside the Debian package repository as apposed to one that I might not have seen. If this is the route I have to go, fine, if there is something more self-contained and easy to use then I'd appreciate knowing about it. Steve --- steve.org.uk	[reply]
Re: Sanitizing HTML by pingo (Hermit) on Sep 29, 2004 at 15:54 UTC
I would suggest you take a look at HTML::TreeBuilder. It does a fine job of fixing broken HTML, and it is quite easy to remove tags that you do not want to allow (using find()). If the HTML is just a snippet (not a complete document with html and body tags), it will add the necessary tags, but one can always use disembowel() to get rid of that. :-)	[reply]
Re: Sanitizing HTML by ccn (Vicar) on Sep 29, 2004 at 13:39 UTC
~~Since HTML::Parser and HTML::Tree are core modules, the dependencies is not a trouble~~. Update: I forgot how I installed it :( They are not core, sorry imho, efficiency does'n matter because submitting a forum message doesn't happen very often	[reply]
•Re^2: Sanitizing HTML by merlyn (Sage) on Sep 29, 2004 at 16:05 UTC
Since HTML::Parser and HTML::Tree are core modules Whuh? Since when? Not in 5.8.5-to-be, and no release prior to that either. Maybe you're thinking of CGI.pm, which was made core many years ago. But nothing in Gisle Aas's realm is core. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re^3: Sanitizing HTML by helgi (Hermit) on Oct 07, 2004 at 12:49 UTC
They are core in Activestate Perl, which is perhaps what causes his confusion. Good for Activestate! -- Regards, Helgi Briem hbriem AT simnet DOT is	[reply]
Re: Sanitizing HTML by Anonymous Monk on Sep 30, 2004 at 09:24 UTC
Re: Proper nesting of HTML to be enforced (the code)	[reply]