in reply to Unicode problem

This is a peculiarity/bug which occurs when the Windows-specific crlf PerlIO layer is being used in combination with a multibyte encoding like UTF-16.  See this node for a more detailed explanation and a workaround (the node talks about UCS-2, but everything equally holds for UTF-16, so you can just substitute the latter).

In short, the solution is to specify the layers as

:raw:encoding(UTF-16LE):crlf:utf8

(hopefully, some solution will be found for an upcoming release that makes this hack unnecessary)

Replies are listed 'Best First'.
Re^2: Unicode problem
by graff (Chancellor) on Aug 21, 2007 at 03:49 UTC
    As I confess in my other reply, I don't have a windows box, so I don't know... Maybe the initial ":raw" is needed to defeat the intrinsic ":crlf" layer that is always imposed first by default on that OS.

    But I'd be really surprised if there was any real need or impact of final ":utf8" -- I think you can dispense with that. (It certainly looks nonsensical having it there.)

    In any case, an attentive reading of the PerlIO manual would be good medicine.

    update: Thanks for the following reply, almut. It seems I shouldn't have been so surprised after all!

      I'd be really surprised if there was any real need or impact of final ":utf8" -- I think you can dispense with that.

      The reason you need the final :utf8 is that the crlf layer is kinda turning off the UTF8-ness (or however you want to call it...). In other words, if you have a string containing non-ASCII characters (which was the reason for inventing Unicode in the first place, wasn't it :), you'd get nonsense, because the utf8 flag will either be ignored (on output), or not be set (on input). Of course, if you're only outputting an ASCII-only string like "hello", you won't see a difference...

      For example, when replacing the "e" in "hello" with an "ä" (a-umlaut, U+00E4), you'd get correct output with

      open my $fh, ">:raw:encoding(UTF-16LE):crlf:utf8", "ok.utf16" or die; print $fh "h\x{00e4}llo\n"; $ od -tx1 -An ok.utf16 68 00 e4 00 6c 00 6c 00 6f 00 0d 00 0a 00

      but not with

      open my $fh, ">:raw:encoding(UTF-16LE):crlf", "err.utf16" or die; print $fh "h\x{00e4}llo\n"; $ od -tx1 -An err.utf16 68 00 00 00 6c 00 6c 00 6f 00 0d 00 0a 00 ^^ wrong

      accompanied by the warning when running the code:

      Malformed UTF-8 character (unexpected non-continuation byte 0x6c, immediately after start byte 0xe4) in null operation at ...