in reply to Parsing of undecoded UTF-8 will give garbage when decoding entities
$page in is the raw html documentDoes $page consist of bytes (e.g. "\x{e2}\x{98}\x{ba}", which can be decoded as U+263A White Smiling Face in UTF-8, or "\x{fe}\x{ff}\x{26}\x{3a}" which is the same U+263A, but in UTF-16), or of characters (e.g. "\x{263A}", which is a U+263A White Smiling Face character and should be encoded before writing it anywhere)? HTML::TokeParser seems to ask for the latter (it wants HTML to be decoded to characters from bytes in whatever encoding they were encoded to). See also: perlunitut.
Of course, this brings us to another problem of correctly determining the encoding of a byte stream, which sometimes should be done by an HTML parser (when charset is defined by meta tag in HTML4/HTML5), sometimes should be done by HTTP client (when a proper Content-type header is sent) and sometimes just has to be guessed. And it's not impossible to misconfigure a webserver to serve Content-Type: text/html; charset=utf-8 with <meta charset="koi8-r"> in HTML while the real encoding is UTF-16LE with BOM.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Parsing of undecoded UTF-8 will give garbage when decoding entities
by itsscott (Sexton) on Aug 25, 2015 at 14:43 UTC | |
by aitap (Curate) on Aug 26, 2015 at 11:44 UTC |