in reply to Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode?

Here's a simple tip: if you want to be dealing with text instead of encoding issues etc (and you almost always want to) use perl's IO layers to deal with input and output.

Note that my $iso_8859_1 = 'Österreich'; is usually only guaranteed to be iso-8859_1 encoded if you know that the source file is iso_8859_1 (instead of utf-8) encoded and/or you've not switched on "use utf8" somewhere. That can cause all kinds of interesting issues.

Also note that this is exactly the kind of thing you do NOT want to have to deal with. I'm tempted to say; just make a habit of use()ing utf8 and switch all your scripts to utf-8 encoding, or only use 7-bit ASCII in source files.

The only sane way to deal with unicode IO is to keep everything correctly flagged as being either in the "internal multibyte encoding" or binary/8-bit, use IO layers for input/output, use Encode::decode() to interpret binary strings directly if you have to, and never, ever use Encode::encode():

my $string = Encode::decode("iso-8859-1","\x{d6}sterreich"); # we want to write a utf-8 file open my $fh,">:utf8","/some/path" or die $!; print $fh $string; close $fh or die $!;
  • Comment on Re: Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode?
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode?
by telcontar (Beadle) on Aug 08, 2007 at 20:00 UTC
    The only sane way to deal with unicode IO is to keep everything correctly flagged as being either in the "internal multibyte encoding" or binary/8-bit, use IO layers for input/output, use Encode::decode() to interpret binary strings directly if you have to, and never, ever use Encode::encode():

    I have to download unspecified web pages and process them as unicode with HTML::Tree though; in this case, how can I avoid using Encode::encode() - I need to pass the data onto the parser:
    $fsuccess = ($response = $ua->get($url))->is_success; die "Could not fetch URI '" . $url . "'\n" unless $fsuccess; $decoded = $response->decoded_content; die "Could not decode content" unless $decoded; $utf8 = Encode::encode_utf8($decoded); $tree = HTML::TreeBuilder->new(); $tree->utf8_mode(1); die "Parse error" unless $tree->parse($utf8); $tree->eof();

    ... goes off to read PerlIO...

    Thanks for replying!
      encode_utf8 does not do what you think it does. You need to decode() to the internal text format. That also means you still need to know what the original encoding is and Encode::decode() needs to support that format.

      update: very compactly:

      Encode::encode() etc translate text strings in perl into binary strings in some external encoding.

      Encode::decode() and friends translate binary strings from some encoding into perl text strings.

        You need to decode() to the internal text format.
        I realize I should have commented on the code. That's what HTTP::Response->decode_content() does. It determines the character set from the Content-Type header (though I think it does not take <meta http-equiv="content-type" content="text/html; charset=XXX> into account, but that's another story. So I decode it to the internal text format and then convert that into utf8, because that's what HTML::Parser->parse() expects in utf_mode(1), unless I still misunderstand the whole story :-(

        That also means you still need to know what the original encoding is and Encode::decode() needs to support that format.
        Yes, this is why I die() if $decode is undef. HTTP::Response->decoded_content() is a wrapper around Encode::decode().

        Thanks for the reply!!

        -- tel