in reply to HTML::Tree problems with UTF-8 Content.
If you pass the parser a file handle as input, do not open that file handle with ":utf8" or any other encoding pragma that would cause the data to be converted to utf8 on input (via a PerlIO layer). If you pass it a scalar string, make sure that it is a string that does not have the utf8 flag set. (See the Encode man page about the utf8 flag.)
After the parsing is done, use Encode::decode() on the various pieces of text content if you need to do utf8 character-based stuff with it.
I presume this is a difficult design issue for the HTML parsing modules -- or maybe it's something that would be fairly easy to fix -- but it certainly is a problem.
|
|---|