Unicode problem

adrodin has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unicode problem by clinton (Priest) on Aug 20, 2007 at 15:12 UTC
Firstly, your sequence `0D 00 0A 00` is not a UTF-16LE encoded CR/LF. That would be `00 0D 00 0A`. Secondly, when I try your code with a correctly UTF-16LE encoded `master.cfg` file, I get exactly the same data output to `new.cfg`. (I tried it on Linux with perl 5.8.8). I would suggest checking that your input file IS correctly encoded, for instance, this UTF-16LE file: `test\r\n test2\r\n` [download] contains these hex characters: `0074 0065 0073 0074 000d 000a 0074 0065 0073 0074 0032 000d 000a` [download] Clint	[reply] [d/l] [select]
Re: Unicode problem by almut (Canon) on Aug 21, 2007 at 03:23 UTC
This is a peculiarity/bug which occurs when the Windows-specific `crlf` PerlIO layer is being used in combination with a multibyte encoding like UTF-16. See this node for a more detailed explanation and a workaround (the node talks about UCS-2, but everything equally holds for UTF-16, so you can just substitute the latter). In short, the solution is to specify the layers as `:raw:encoding(UTF-16LE):crlf:utf8` [download] (hopefully, some solution will be found for an upcoming release that makes this hack unnecessary)	[reply] [d/l] [select]
Re^2: Unicode problem by graff (Chancellor) on Aug 21, 2007 at 03:49 UTC
As I confess in my other reply, I don't have a windows box, so I don't know... Maybe the initial ":raw" is needed to defeat the intrinsic ":crlf" layer that is always imposed first by default on that OS. But I'd be really surprised if there was any real need or impact of final ":utf8" -- I think you can dispense with that. (It certainly looks nonsensical having it there.) In any case, an attentive reading of the PerlIO manual would be good medicine. update: Thanks for the following reply, almut. It seems I shouldn't have been so surprised after all!	[reply]
Re^3: Unicode problem by almut (Canon) on Aug 21, 2007 at 04:52 UTC
I'd be really surprised if there was any real need or impact of final ":utf8" -- I think you can dispense with that. The reason you need the final `:utf8` is that the `crlf` layer is kinda turning off the UTF8-ness (or however you want to call it...). In other words, if you have a string containing non-ASCII characters (which was the reason for inventing Unicode in the first place, wasn't it :), you'd get nonsense, because the utf8 flag will either be ignored (on output), or not be set (on input). Of course, if you're only outputting an ASCII-only string like "hello", you won't see a difference... For example, when replacing the "e" in "hello" with an "ä" (a-umlaut, U+00E4), you'd get correct output with `open my $fh, ">:raw:encoding(UTF-16LE):crlf:utf8", "ok.utf16" or die; print $fh "h\x{00e4}llo\n"; $ od -tx1 -An ok.utf16 68 00 e4 00 6c 00 6c 00 6f 00 0d 00 0a 00` [download] but not with `open my $fh, ">:raw:encoding(UTF-16LE):crlf", "err.utf16" or die; print $fh "h\x{00e4}llo\n"; $ od -tx1 -An err.utf16 68 00 00 00 6c 00 6c 00 6f 00 0d 00 0a 00 ^^ wrong` [download] accompanied by the warning when running the code: `Malformed UTF-8 character (unexpected non-continuation byte 0x6c, immediately after start byte 0xe4) in null operation at ...` [download]	[reply] [d/l] [select]
Re: Unicode problem by Errto (Vicar) on Aug 20, 2007 at 18:14 UTC
The issue may have to do with the way newlines are handled in Windows. First see what happens if you read and write the entire file at once rather than line by line: `undef $/; my $content = <$MASTER>; print $CONFIG $content;` [download] If that works fine, my theory is probably right. If you do still want to use line-by-line transfer, I believe you can add `:unix` to the layers on your `open` calls.	[reply] [d/l] [select]
Re: Unicode problem by graff (Chancellor) on Aug 21, 2007 at 03:34 UTC
I suspect you are using a windows system, and Errto is on the right track: you need to take proper control of the "native crlf" behavior for that OS. Note that ordering of the PerlIO layers can be significant. I don't have a windows system to test on myself, but my bsd-based macosx shows the following behaviors with the various permutations - YMMV, but I think you'll see something like this: $ perl -e 'open($fh,">:crlf:encoding(UTF-16BE)", "test.utf16"); print +$fh "hello\n"' $ xxd test.utf16 0000000: 0068 0065 006c 006c 006f 000d 0a .h.e.l.l.o... # that was bad -- odd number of bytes $ perl -e 'open($fh,">:encoding(UTF-16BE):crlf", "test.utf16"); print +$fh "hello\n"' $ xxd test.utf16 0000000: 0068 0065 006c 006c 006f 000d 000a .h.e.l.l.o.... # that was good. $ perl -e 'open($fh,">:raw:encoding(UTF-16BE)", "test.utf16"); print $ +fh "hello\n"' $ xxd test.utf16 0000000: 0068 0065 006c 006c 006f 000a .h.e.l.l.o.. # also good (no CR, but who needs that anyway? ;) $ perl -e 'open($fh,">:encoding(UTF-16BE):raw", "test.utf16"); print $ +fh "hello\n"' $ xxd test.utf16 0000000: 6865 6c6c 6f0a hello. # not what you want [download] Note that ":raw" is sort of a synonym for "unix" in this context (that is, the last two examples behave the same when using "unix" instead of "raw"). So, if you want a "standard" CRLF discipline for output, interacting correctly with UTF-16, there's only the one way to do that, it seems; OTOH, if you want unix-like LF discipline with UTF-16, there's a couple ways to get that (that is, you can say "unix" or "raw", but you still have to get the layers in the right order). update: As almut wisely explains above, my examples are deficient -- each of the "working" cases should have the additional ":utf8" layer at the end, so that actual "wide characters" in the strings being printed will be interpreted and encoded correctly on output. Apologies for the confusion.	[reply] [d/l]
Re^2: Unicode problem by adrodin (Initiate) on Aug 21, 2007 at 06:33 UTC
'<:raw:encoding(UTF-16LE):crlf:utf8' and '>:raw:encoding(UTF-16LE):crlf:utf8' do the job fine. Thanks, guys. The following page is also rather helpful... http://blogs.msdn.com/brettsh/archive/2006/06/07/620986.aspx	[reply]