adrodin has asked for the wisdom of the Perl Monks concerning the following question:

I'm new to Perl's unicode handling, and am bemused by this:
open(my $MASTER,'<:encoding(UTF-16LE)', 'master.cfg'); open(my $CONFIG,'>:encoding(UTF-16LE)', 'new.cfg'); while ($line=<$MASTER>) { print $CONFIG $line; }
I'm running ActivePerl 5.8.7 on Windows. new.cfg is identical to master.cfg, except that each sequence 0D 00 0A 00 (cr/lf) in the input file is being turned in 0D 00 0D 0A 00 in the output file, i.e. an extra 0D is getting inserted. Any suggestions why, and how I avoid this. Something obvious, I'm sure.

Replies are listed 'Best First'.
Re: Unicode problem
by clinton (Priest) on Aug 20, 2007 at 15:12 UTC
    Firstly, your sequence 0D 00 0A 00 is not a UTF-16LE encoded CR/LF. That would be 00 0D 00 0A.

    Secondly, when I try your code with a correctly UTF-16LE encoded master.cfg file, I get exactly the same data output to new.cfg. (I tried it on Linux with perl 5.8.8).

    I would suggest checking that your input file IS correctly encoded, for instance, this UTF-16LE file:

    test\r\n test2\r\n
    contains these hex characters:
    0074 0065 0073 0074 000d 000a 0074 0065 0073 0074 0032 000d 000a

    Clint

Re: Unicode problem
by Errto (Vicar) on Aug 20, 2007 at 18:14 UTC
    The issue may have to do with the way newlines are handled in Windows. First see what happens if you read and write the entire file at once rather than line by line:
    undef $/; my $content = <$MASTER>; print $CONFIG $content;
    If that works fine, my theory is probably right. If you do still want to use line-by-line transfer, I believe you can add :unix to the layers on your open calls.
Re: Unicode problem
by almut (Canon) on Aug 21, 2007 at 03:23 UTC

    This is a peculiarity/bug which occurs when the Windows-specific crlf PerlIO layer is being used in combination with a multibyte encoding like UTF-16.  See this node for a more detailed explanation and a workaround (the node talks about UCS-2, but everything equally holds for UTF-16, so you can just substitute the latter).

    In short, the solution is to specify the layers as

    :raw:encoding(UTF-16LE):crlf:utf8

    (hopefully, some solution will be found for an upcoming release that makes this hack unnecessary)

      As I confess in my other reply, I don't have a windows box, so I don't know... Maybe the initial ":raw" is needed to defeat the intrinsic ":crlf" layer that is always imposed first by default on that OS.

      But I'd be really surprised if there was any real need or impact of final ":utf8" -- I think you can dispense with that. (It certainly looks nonsensical having it there.)

      In any case, an attentive reading of the PerlIO manual would be good medicine.

      update: Thanks for the following reply, almut. It seems I shouldn't have been so surprised after all!

        I'd be really surprised if there was any real need or impact of final ":utf8" -- I think you can dispense with that.

        The reason you need the final :utf8 is that the crlf layer is kinda turning off the UTF8-ness (or however you want to call it...). In other words, if you have a string containing non-ASCII characters (which was the reason for inventing Unicode in the first place, wasn't it :), you'd get nonsense, because the utf8 flag will either be ignored (on output), or not be set (on input). Of course, if you're only outputting an ASCII-only string like "hello", you won't see a difference...

        For example, when replacing the "e" in "hello" with an "ä" (a-umlaut, U+00E4), you'd get correct output with

        open my $fh, ">:raw:encoding(UTF-16LE):crlf:utf8", "ok.utf16" or die; print $fh "h\x{00e4}llo\n"; $ od -tx1 -An ok.utf16 68 00 e4 00 6c 00 6c 00 6f 00 0d 00 0a 00

        but not with

        open my $fh, ">:raw:encoding(UTF-16LE):crlf", "err.utf16" or die; print $fh "h\x{00e4}llo\n"; $ od -tx1 -An err.utf16 68 00 00 00 6c 00 6c 00 6f 00 0d 00 0a 00 ^^ wrong

        accompanied by the warning when running the code:

        Malformed UTF-8 character (unexpected non-continuation byte 0x6c, immediately after start byte 0xe4) in null operation at ...
Re: Unicode problem
by graff (Chancellor) on Aug 21, 2007 at 03:34 UTC
    I suspect you are using a windows system, and Errto is on the right track: you need to take proper control of the "native crlf" behavior for that OS. Note that ordering of the PerlIO layers can be significant. I don't have a windows system to test on myself, but my bsd-based macosx shows the following behaviors with the various permutations - YMMV, but I think you'll see something like this:
    $ perl -e 'open($fh,">:crlf:encoding(UTF-16BE)", "test.utf16"); print +$fh "hello\n"' $ xxd test.utf16 0000000: 0068 0065 006c 006c 006f 000d 0a .h.e.l.l.o... # that was bad -- odd number of bytes $ perl -e 'open($fh,">:encoding(UTF-16BE):crlf", "test.utf16"); print +$fh "hello\n"' $ xxd test.utf16 0000000: 0068 0065 006c 006c 006f 000d 000a .h.e.l.l.o.... # that was good. $ perl -e 'open($fh,">:raw:encoding(UTF-16BE)", "test.utf16"); print $ +fh "hello\n"' $ xxd test.utf16 0000000: 0068 0065 006c 006c 006f 000a .h.e.l.l.o.. # also good (no CR, but who needs that anyway? ;) $ perl -e 'open($fh,">:encoding(UTF-16BE):raw", "test.utf16"); print $ +fh "hello\n"' $ xxd test.utf16 0000000: 6865 6c6c 6f0a hello. # not what you want
    Note that ":raw" is sort of a synonym for "unix" in this context (that is, the last two examples behave the same when using "unix" instead of "raw").

    So, if you want a "standard" CRLF discipline for output, interacting correctly with UTF-16, there's only the one way to do that, it seems; OTOH, if you want unix-like LF discipline with UTF-16, there's a couple ways to get that (that is, you can say "unix" or "raw", but you still have to get the layers in the right order).

    update: As almut wisely explains above, my examples are deficient -- each of the "working" cases should have the additional ":utf8" layer at the end, so that actual "wide characters" in the strings being printed will be interpreted and encoded correctly on output. Apologies for the confusion.