Another option for tidying the HTML before sanitising it is XML::LibXML. It has a parse_html method that gracefully copes with mismatched tag nesting, broken quoting and other common offences. You can then use the toStringHTML method to produce nice clean HTML.