comment on

Which HTML parsing modules can decode HTML of any encoding properly?

Ideally, I'd like a parser that I can invoke in two ways. If I already know the encoding of the HTML for sure (eg. from the HTTP header), I tell that encoding to the module and it decodes it (or I decode it myself and pass the decoded text, doesn't matter). If I don't know, I pass an undecoded byte stream, and it checks the HTML for a meta http-equiv content-type tag which tells the encoding (for which it will first has to check for byte order marks to be able to find that tag in utf-16 ~~(and utf-32)~~ encoded text), and decodes the HTML using that automatically. (If the encoding is unknown and there's no byte order mark, it guesses some default, which could be cp1252 or possibly user-specified.)

It appears that HTML::Tree cannot do this. Does anyone know about the parsers of HTML::Tidy or XML::LibXML, or any other module? Obviously the parsers of most browsers would have some code like this. I could try to implement this myself and contribute to HTML::Tree, but I would like to know about any existing implementation first.

Update 2011-11-17: striked out the part about utf-32, for the HTML5 draft standard recommends against it. However, I believe utf-16 encoded HTML exists in the wild, so that part may still be important.

In reply to HTML parsing module handles known and unknown encoding by ambrus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.