in reply to Re: Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode?
in thread Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode?

The only sane way to deal with unicode IO is to keep everything correctly flagged as being either in the "internal multibyte encoding" or binary/8-bit, use IO layers for input/output, use Encode::decode() to interpret binary strings directly if you have to, and never, ever use Encode::encode():

I have to download unspecified web pages and process them as unicode with HTML::Tree though; in this case, how can I avoid using Encode::encode() - I need to pass the data onto the parser:
$fsuccess = ($response = $ua->get($url))->is_success; die "Could not fetch URI '" . $url . "'\n" unless $fsuccess; $decoded = $response->decoded_content; die "Could not decode content" unless $decoded; $utf8 = Encode::encode_utf8($decoded); $tree = HTML::TreeBuilder->new(); $tree->utf8_mode(1); die "Parse error" unless $tree->parse($utf8); $tree->eof();

... goes off to read PerlIO...

Thanks for replying!
  • Comment on Re^2: Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode?
  • Download Code

Replies are listed 'Best First'.
Re^3: Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode?
by Joost (Canon) on Aug 08, 2007 at 20:12 UTC
    encode_utf8 does not do what you think it does. You need to decode() to the internal text format. That also means you still need to know what the original encoding is and Encode::decode() needs to support that format.

    update: very compactly:

    Encode::encode() etc translate text strings in perl into binary strings in some external encoding.

    Encode::decode() and friends translate binary strings from some encoding into perl text strings.

      You need to decode() to the internal text format.
      I realize I should have commented on the code. That's what HTTP::Response->decode_content() does. It determines the character set from the Content-Type header (though I think it does not take <meta http-equiv="content-type" content="text/html; charset=XXX> into account, but that's another story. So I decode it to the internal text format and then convert that into utf8, because that's what HTML::Parser->parse() expects in utf_mode(1), unless I still misunderstand the whole story :-(

      That also means you still need to know what the original encoding is and Encode::decode() needs to support that format.
      Yes, this is why I die() if $decode is undef. HTTP::Response->decoded_content() is a wrapper around Encode::decode().

      Thanks for the reply!!

      -- tel