in reply to HTML::Tree problems with UTF-8 Content.

I've had this sort of trouble with the HTML parsing modules too. For reasons that probably make sense for some applications, the parsing strategy seems to be based on pulling fixed-size chunks of bytes from whatever input (even a scalar string, which seems odd) -- and then doing some operations on the chunks where perl 5.8's flexibility ("byte-semantics" vs. "character-semantics") goes awry, and ends up trying to do utf8 operations on a character where the final byte or two fell on the wrong side of a chunk boundary.

If you pass the parser a file handle as input, do not open that file handle with ":utf8" or any other encoding pragma that would cause the data to be converted to utf8 on input (via a PerlIO layer). If you pass it a scalar string, make sure that it is a string that does not have the utf8 flag set. (See the Encode man page about the utf8 flag.)

After the parsing is done, use Encode::decode() on the various pieces of text content if you need to do utf8 character-based stuff with it.

I presume this is a difficult design issue for the HTML parsing modules -- or maybe it's something that would be fairly easy to fix -- but it certainly is a problem.

  • Comment on Re: HTML::Tree problems with UTF-8 Content.