Firstly, your sequence 0D 00 0A 00 is not a UTF-16LE encoded CR/LF. That would be 00 0D 00 0A.
Secondly, when I try your code with a correctly UTF-16LE encoded master.cfg file, I get exactly the same data output to new.cfg. (I tried it on Linux with perl 5.8.8).
I would suggest checking that your input file IS correctly encoded, for instance, this UTF-16LE file:
test\r\n
test2\r\n
contains these hex characters:
0074 0065 0073 0074 000d 000a 0074 0065
0073 0074 0032 000d 000a
Clint
| [reply] [d/l] [select] |
The issue may have to do with the way newlines are handled in Windows. First see what happens if you read and write the entire file at once rather than line by line:
undef $/;
my $content = <$MASTER>;
print $CONFIG $content;
If that works fine, my theory is probably right. If you do still want to use line-by-line transfer, I believe you can add :unix to the layers on your open calls. | [reply] [d/l] [select] |
This is a peculiarity/bug which occurs when the Windows-specific
crlf PerlIO layer is being used in combination with a multibyte
encoding like UTF-16. See this node for a more
detailed explanation and a workaround (the node talks about UCS-2, but
everything equally holds for UTF-16, so you can just substitute the latter).
In short, the solution is to specify the layers as
:raw:encoding(UTF-16LE):crlf:utf8
(hopefully, some solution will be found for an upcoming release that
makes this hack unnecessary)
| [reply] [d/l] [select] |
As I confess in my other reply, I don't have a windows box, so I don't know... Maybe the initial ":raw" is needed to defeat the intrinsic ":crlf" layer that is always imposed first by default on that OS.
But I'd be really surprised if there was any real need or impact of final ":utf8" -- I think you can dispense with that. (It certainly looks nonsensical having it there.)
In any case, an attentive reading of the PerlIO manual would be good medicine.
update: Thanks for the following reply, almut. It seems I shouldn't have been so surprised after all!
| [reply] |
I'd be really surprised if there was any real need or impact of
final ":utf8" -- I think you can dispense with that.
The reason you need the final :utf8 is that the crlf
layer is kinda turning off the UTF8-ness (or however you want to call
it...). In other words, if you have a string containing non-ASCII
characters (which was the reason for inventing Unicode in the first
place, wasn't it :), you'd get nonsense, because the utf8 flag will
either be ignored (on output), or not be set (on input). Of course, if
you're only outputting an ASCII-only string like "hello", you won't see
a difference...
For example, when replacing the "e" in "hello" with an "ä"
(a-umlaut, U+00E4), you'd get correct output with
open my $fh, ">:raw:encoding(UTF-16LE):crlf:utf8", "ok.utf16" or die;
print $fh "h\x{00e4}llo\n";
$ od -tx1 -An ok.utf16
68 00 e4 00 6c 00 6c 00 6f 00 0d 00 0a 00
but not with
open my $fh, ">:raw:encoding(UTF-16LE):crlf", "err.utf16" or die;
print $fh "h\x{00e4}llo\n";
$ od -tx1 -An err.utf16
68 00 00 00 6c 00 6c 00 6f 00 0d 00 0a 00
^^
wrong
accompanied by the warning when running the code:
Malformed UTF-8 character (unexpected non-continuation byte 0x6c,
immediately after start byte 0xe4) in null operation at ...
| [reply] [d/l] [select] |
I suspect you are using a windows system, and Errto is on the right track: you need to take proper control of the "native crlf" behavior for that OS. Note that ordering of the PerlIO layers can be significant. I don't have a windows system to test on myself, but my bsd-based macosx shows the following behaviors with the various permutations - YMMV, but I think you'll see something like this:
$ perl -e 'open($fh,">:crlf:encoding(UTF-16BE)", "test.utf16"); print
+$fh "hello\n"'
$ xxd test.utf16
0000000: 0068 0065 006c 006c 006f 000d 0a .h.e.l.l.o...
# that was bad -- odd number of bytes
$ perl -e 'open($fh,">:encoding(UTF-16BE):crlf", "test.utf16"); print
+$fh "hello\n"'
$ xxd test.utf16
0000000: 0068 0065 006c 006c 006f 000d 000a .h.e.l.l.o....
# that was good.
$ perl -e 'open($fh,">:raw:encoding(UTF-16BE)", "test.utf16"); print $
+fh "hello\n"'
$ xxd test.utf16
0000000: 0068 0065 006c 006c 006f 000a .h.e.l.l.o..
# also good (no CR, but who needs that anyway? ;)
$ perl -e 'open($fh,">:encoding(UTF-16BE):raw", "test.utf16"); print $
+fh "hello\n"'
$ xxd test.utf16
0000000: 6865 6c6c 6f0a hello.
# not what you want
Note that ":raw" is sort of a synonym for "unix" in this context (that is, the last two examples behave the same when using "unix" instead of "raw").
So, if you want a "standard" CRLF discipline for output, interacting correctly with UTF-16, there's only the one way to do that, it seems; OTOH, if you want unix-like LF discipline with UTF-16, there's a couple ways to get that (that is, you can say "unix" or "raw", but you still have to get the layers in the right order).
update: As almut wisely explains above, my examples are deficient -- each of the "working" cases should have the additional ":utf8" layer at the end, so that actual "wide characters" in the strings being printed will be interpreted and encoded correctly on output. Apologies for the confusion. | [reply] [d/l] |
| [reply] |