in reply to Problems handling UTF8 ! And removing accents.
We need to separate the problem into three parts:
In your first program, you have a source file that is plain ASCII. In the program, you hand Perl two octets that represent the UTF-8 encoding. So Perl thinks this string should have length 2, because it consists of two bytes and is a "Latin-1" string. When printing your data, you don't tell Perl that there should be anything special done, so Perl assumes you want Latin-1 as output format. Latin-1 means no modification to your string is made. Your console expects UTF-8 and the two bytes that Perl outputs happen to map to Eacute.
Here, adding a binmode STDOUT, ':encoding(UTF-8)'; should Perl tell that you want UTF-8 on output, and using my $string= decode('UTF-8', "\x{c3}\x{a9}"); to tell Perl that you want the string parts to be interpreted as UTF-8 should change the program to suit what you want.
In your second program, you have a source file that is UTF-8. In the program, you hand Perl two octets that represent the UTF-8 encoding, and tell Perl that the program source is UTF-8. So Perl thinks this string should have length 1, because it consists of two bytes and is an "UTF-8" string. When printing your data, you don't tell Perl that there should be anything special done, so Perl assumes you want Latin-1 as output format. So Perl converts your UTF-8 string to Latin-1 when printing it. Your console expects UTF-8 and the single byte that Perl outputs happens to be an invalid UTF-8 sequence.
Here, you only need to tell Perl that you want UTF-8 on output by using binmode on STDOUT.
The two modules you use expect Unicode input, but you hand them byte sequences. You want to use Encode::decode to decode them to real Unicode strings:
use Encode 'decode'; my $string= decode 'UTF-8', "\x{c3}\x{a9}"; ...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Problems handling UTF8 ! And removing accents.
by prunkdump (Initiate) on Oct 28, 2014 at 09:25 UTC |