ucs-2be <-> utf8 ascii

germanuser has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I hope you can help me. I have a problem with converting between ucs-2be and utf8 ascii.

< I'm using activeperl 5.8 and tried:

use Encode qw/encode decode/; $ucs2 = encode("UCS-2BE", $content);

or

open(FILE, ">:encoding(ucs-2be)", "output.txt");

The problem is yet, it doesn't make 4 bytes by a carriage return and line feed. So my source file (ascii/utf8) has a \r\n (x0D0A) and should convert to ucs-2be. the result is x000D0A, but it has to be x000D000A.

Also I have the problem the other way. When my ucs-2be source file with x000D000A will convert to ascii/utf8 I have x0D0D0A instead to have just x0D0A.

Who has an idea whats wrong or what I forgot to take care of this problem?

Comment on ucs-2be <-> utf8 ascii

Replies are listed 'Best First'.
Re: ucs-2be <-> utf8 ascii by graff (Chancellor) on Jun 17, 2004 at 03:09 UTC
The following works for me (on 5.8.1, macosx) -- it's a simple stdin->stdout filter: `#!/usr/bin/perl use Encode; binmode STDIN, ":utf8"; binmode STDOUT, ":ucs-2be"; while(<>){ print encode( "ucs-2be", $_ ); }` [download] If the input happens to be straight ASCII (which is really just a subset of utf8 now), the resulting output is exactly twice as many bytes as the input (and every even-numbered byte offset starting at offset 0 is a null byte). Both unix and dos style line terminations are treated consistently: every byte gets converted. For input that actually has some wide characters in it, the difference in size between input and output will vary, and each wide character will of course have a non-null high byte in the output. It's not clear to me what's wrong with your code. (Maybe that's because I saw it before anyone added "<code>" tags, or maybe it's just that you didn't show all the relevant stuff.) Or maybe you're using 5.8.0, and ~~this might have been a problem there~~ that version might have had some trouble with handling line termination? (I'm not sure about that...) update: I forgot about the "return trip"... this works for me too, going the other direction: `#!/usr/bin/perl use Encode; binmode STDOUT, ":utf8"; binmode STDIN, ":ucs-2be"; while(<>){ print decode( "ucs-2be", $_ ); }` [download] I checked a dos-style ASCII file on the round-trip -- the ucs-16be version was valid, and the return from that to utf8 came out identical to the original data.	[reply] [d/l] [select]
Re: ucs-2be <-> utf8 ascii by iburrell (Chaplain) on Jun 17, 2004 at 16:13 UTC
There is a bug with CR-LF filter and multibyte encodings. It sounds like the crlf layer is byte-oriented and puts on the x0A after the encoding. The bug report, http://rt.perl.org/rt3/index.html?q=24077, has a couple of workarounds. Either turning of the crlf layer, or changing the order.	[reply]
Re: ucs-2be <-> utf8 ascii by eserte (Deacon) on Jun 16, 2004 at 20:01 UTC
I just tried the following on a Linux system and I think the result is correct: `perl -MEncode=encode -e '$x="a\015\012b"; $ucs2 = encode("UCS-2BE", $x +); print $ucs2' \| hexdump 0000000 \0 a \0 \r \0 \n \0 b + 0000008` [download] Maybe a Windows or ActivePerl problem?	[reply] [d/l]