FalseVinylShrub has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm writing a site using catalyst, as a learning experience (though I hope something useful will come out of it eventually).

There is going to be user-submitted content and I want to allow formatting. I'm currently using Markdown but planning to allow a limited set of HTML to be entered too.

So rather than create my own html sanitising code, I looked through CPAN. I decided to use HTML::Defang as it looks the most thorough but I can't find much information about how it's been kept up to date. Does anyone know anything better, and is there a project to keep such things up to date in Perl?

There seem to be quite comprehensive and busy projects for similar things in PHP (HTML Purifier), Java and .NET (AntiSamy).

What does PerlMonks use? What do you use? At what point in the process do you sanitise the HTML?

I'm still reading through various resources such as ha.ckers.org/xss.html and it's definitely a complex topic...

Thanks in advance,

FalseVinylShrub

Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

Replies are listed 'Best First'.
Re: HTML cleanup of user submitted content
by zentara (Cardinal) on Jan 08, 2010 at 16:35 UTC
Re: HTML cleanup of user submitted content
by GrandFather (Saint) on Jan 08, 2010 at 21:43 UTC

    HTML::Normalize may be of interest. Although the module is new (at least in terms of development), I am sure the author will promptly follow up any bug reports or feature requests. ;) Note though that it is designed to clean up parseable HTML rather than try to impose an HTML interpretation on a random string of characters (as may be presented by some less HTML savy users).


    True laziness is hard work
Re: HTML cleanup of user submitted content
by Herkum (Parson) on Jan 08, 2010 at 20:14 UTC

    If you are going to use some formatting perhaps you should consider using BBCode rather than HTML. BBCode is used in number of Bulletin Boards so is fairly well known.

    There is a module BBCode::Parser which can take care of the BBCode to HTML presentation for you as well.

Re: HTML cleanup of user submitted content
by FalseVinylShrub (Chaplain) on Jan 18, 2010 at 07:27 UTC

    Hi

    Belated thanks for the responses. I thought I'd update with what I did. I got sidetracked from this project a bit but I'm back into it now.

    HTML::StripScripts looks like what I need: I'm more concerned about XSS attacks than anything else. I'd not found that in the various searchs that I did.

    I did some further testing of HTML::Defang and it's pretty impressive. Example:

    <IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61& +#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29>

    Becomes:

    <IMG defang_SRC=javascript:alert('XSS')>

    Note the lack of semicolons on the encoded character references in an attempt to confuse filters. (taken from ha.ckers.org/xss.html)

    I will do some similar tests on HTML::StripScripts and post the results. This module has some options that I may ind useful compared to HTML::Defang (escape disallowed tags with &lt;/&gt; so they appear on the page, for example) - to be investigated and tested.

    Looking into the posibilities has made me think seriously about disallowing HTML entry at all and using another markup language. That still has to be tested for allowing scripts though ;-)

    Cheers

    FalseVinylShrub

    Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.