Which HTML parsing modules can decode HTML of any encoding properly?
Ideally, I'd like a parser that I can invoke in two ways. If I already know the encoding of the HTML for sure (eg. from the HTTP header), I tell that encoding to the module and it decodes it (or I decode it myself and pass the decoded text, doesn't matter). If I don't know, I pass an undecoded byte stream, and it checks the HTML for a meta http-equiv content-type tag which tells the encoding (for which it will first has to check for byte order marks to be able to find that tag in utf-16 (and utf-32) encoded text), and decodes the HTML using that automatically. (If the encoding is unknown and there's no byte order mark, it guesses some default, which could be cp1252 or possibly user-specified.)
It appears that HTML::Tree cannot do this. Does anyone know about the parsers of HTML::Tidy or XML::LibXML, or any other module? Obviously the parsers of most browsers would have some code like this. I could try to implement this myself and contribute to HTML::Tree, but I would like to know about any existing implementation first.
Update 2011-11-17: striked out the part about utf-32, for the HTML5 draft standard recommends against it. However, I believe utf-16 encoded HTML exists in the wild, so that part may still be important.
In reply to HTML parsing module handles known and unknown encoding by ambrus
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |