Encode a Utf-8 file to Unicode (TRICKY)

JustMe79 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm using perl5.8.7 and I'm trying to encode a Utf-8 file to Unicode. I've already managed to encode from Unicode to Utf-8 with following code:

#!C:\perl\perl.exe -w

use open IN => ':encoding(UCS-2LE)', OUT => ':encoding(UTF-8)';
use open ':std';

while (<STDIN>) {
    print STDOUT;
}
[download]

But I can't reverse the process!!! Are there anyone who know how to solve my problem? Need answer quick! Thanks!

20050714 Cleaned up by Corion: Added code tags

Comment on Encode a Utf-8 file to Unicode (TRICKY) Download Code

Replies are listed 'Best First'.
Re: Encode a Utf-8 file to Unicode (TRICKY) by zby (Vicar) on Jul 14, 2005 at 13:57 UTC
First you need to know that `UCS-2LE != Unicode`. Then you might read the docs in the Encode module, and provide us exact examples what does not work and why you think it does not.	[reply] [d/l]
Re: Encode a Utf-8 file to Unicode (TRICKY) by dave_the_m (Monsignor) on Jul 14, 2005 at 13:04 UTC
You don't say if just reversing the encoding args doesn't work for you, and if not, why not; ie `use open IN => ':encoding(UTF-8)', OUT => ':encoding(UCS-2LE)';` [download] Dave.	[reply] [d/l]
Re: Encode a Utf-8 file to Unicode (TRICKY) by revdiablo (Prior) on Jul 14, 2005 at 16:55 UTC
For the sake of those who don't quite understand zby's comment all the way, I'd like to point out that Unicode is a character set where as `UTF-8` is a character encoding. A character encoding is used to encode the characters from a character set into bits and bytes. A character set is an abstract thing, so converting from an encoding to a character set doesn't make a whole lot of sense (unless you are going to then encode the characters into another encoding). Moreover, UTF-8 already uses the Unicode character set. Your question should properly be, "how do I encode a UTF-8 file as UCS-2LE." UCS-2LE is just another encoding, it is not Unicode.	[reply] [d/l]
Re: Encode a Utf-8 file to Unicode (TRICKY) by graff (Chancellor) on Jul 15, 2005 at 02:32 UTC
Apart from the points of confusion mentioned in the earlier replies, you also need to be a little more specific about what you mean when you say you "can't reverse the process". I know you mean that when you try to convert back from utf8 to UCS-2LE (a.k.a UTF-16LE), the resulting data file is different somehow from the original UCS-2LE data. But how is it different, exactly? Are characters missing, or added, or altered? Is the data corrupted in some way that makes it impossible for UC-2LE-based applications to read it or display it correctly? Can you pinpoint where the difference first shows up, and which particular characters are involved? If you can look carefully at what the differences are, and update your post to include details about the differences you find, perhaps we'll be able to help you better. (update: ... and it wouldn't hurt if you explicitly show what you tried in order to "reverse the process" -- it could be that if you tried some other method, it might work, but we won't know what to suggest if we don't know what you've tried)	[reply]