telcontar has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I am trying to write unicode, specifically utf-8, to a file; as the data exists in iso-8859-1 (or another character set) it must be converted first.
I then write an utf-8 string to a file, and after reading the docs I thought I must open the file for writing using open($fh, '>:utf8', $filename), however when I do this and look at the file in any unicode-capable editor I see garbage. If I write the file normally, using open($fh, '>', $filename) all seems well. As this contradicts perluniintro, which clearly states one should use the former open() method, or even use use open ':utf8' when dealing with files, I am sure I must be doing something wrong.

The following code is meant to illustrate my problem. The files '_original' and '_decoded' are the same and I do not find this surprising. The file '_utf8' does not display the characters correctly, unless I change the code to write_to_file('>', '_utf8', $utf8);.
use Encode; sub write_to_file { my ($mode, $filename, $what) = @_; open (my $fh, $mode, $filename) or die "Couldn't open $filename for writing: $@"; print $fh $what; close $fh; } my $iso_8859_1 = 'Österreich'; my $string = Encode::decode('iso-8859-1', 'Österreich'); my $utf8 = Encode::encode_utf8($string); write_to_file('>', '_original', $iso_8859_1); write_to_file('>', '_decoded', $string); write_to_file('>:utf8', '_utf8', $utf8);
I would appreciate any wisdom you could shed on the matter.

-- tel

Replies are listed 'Best First'.
Re: Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode?
by Joost (Canon) on Aug 08, 2007 at 19:05 UTC
    Here's a simple tip: if you want to be dealing with text instead of encoding issues etc (and you almost always want to) use perl's IO layers to deal with input and output.

    Note that my $iso_8859_1 = 'Österreich'; is usually only guaranteed to be iso-8859_1 encoded if you know that the source file is iso_8859_1 (instead of utf-8) encoded and/or you've not switched on "use utf8" somewhere. That can cause all kinds of interesting issues.

    Also note that this is exactly the kind of thing you do NOT want to have to deal with. I'm tempted to say; just make a habit of use()ing utf8 and switch all your scripts to utf-8 encoding, or only use 7-bit ASCII in source files.

    The only sane way to deal with unicode IO is to keep everything correctly flagged as being either in the "internal multibyte encoding" or binary/8-bit, use IO layers for input/output, use Encode::decode() to interpret binary strings directly if you have to, and never, ever use Encode::encode():

    my $string = Encode::decode("iso-8859-1","\x{d6}sterreich"); # we want to write a utf-8 file open my $fh,">:utf8","/some/path" or die $!; print $fh $string; close $fh or die $!;
      The only sane way to deal with unicode IO is to keep everything correctly flagged as being either in the "internal multibyte encoding" or binary/8-bit, use IO layers for input/output, use Encode::decode() to interpret binary strings directly if you have to, and never, ever use Encode::encode():

      I have to download unspecified web pages and process them as unicode with HTML::Tree though; in this case, how can I avoid using Encode::encode() - I need to pass the data onto the parser:
      $fsuccess = ($response = $ua->get($url))->is_success; die "Could not fetch URI '" . $url . "'\n" unless $fsuccess; $decoded = $response->decoded_content; die "Could not decode content" unless $decoded; $utf8 = Encode::encode_utf8($decoded); $tree = HTML::TreeBuilder->new(); $tree->utf8_mode(1); die "Parse error" unless $tree->parse($utf8); $tree->eof();

      ... goes off to read PerlIO...

      Thanks for replying!
        encode_utf8 does not do what you think it does. You need to decode() to the internal text format. That also means you still need to know what the original encoding is and Encode::decode() needs to support that format.

        update: very compactly:

        Encode::encode() etc translate text strings in perl into binary strings in some external encoding.

        Encode::decode() and friends translate binary strings from some encoding into perl text strings.

Re: Writing unicode characters to file using open($fh, ">:utf8, $name) mangles unicode?
by ikegami (Patriarch) on Aug 08, 2007 at 16:21 UTC

    First of all, writing a string of characters to a file without first encoding makes assumptions about Perl's internal format and can earn you some warnings. That means

    write_to_file('>', '_decoded', $string);

    is wrong. There are two ways of encoding a string.

    write_to_file('>', '_explicit_utf8', encode_utf8($string)); write_to_file('>:utf8', '_implicit_utf8', $string);

    The problem you are having is that you are encoding it using encode_utf8 and then again using :utf8.

      Thank you, that makes perfect sense. It is supposed to be transparent and here I was doing it twice :-)

      But what if I download a web page, say LWP::UserAgent->get($url), and save it in a file in its native encoding, and this is not listed in Encode->encodings(':all') - am I stuck? :-)

      -- tel

        I'm not sure what you are asking.

        If you want to save the document in its original encoding: open(my $fh, '>', $filename); doesn't do any encoding. If you don't do any decoding, print $fh $raw; will save the content in its native encoding.

        If you want to save the document in UTF-8:

        Yeah, you're screwed. If Encode "doesn't speak the language", you won't be able to decode the content, so you're left with a bunch of meaningless octets.