Re: Problems handling UTF8 ! And removing accents.

We need to separate the problem into three parts:

The source of your data, and the encoding there
The Perl program and how the string is marked there
The output of your data, and the encoding there

In your first program, you have a source file that is plain ASCII. In the program, you hand Perl two octets that represent the UTF-8 encoding. So Perl thinks this string should have length 2, because it consists of two bytes and is a "Latin-1" string. When printing your data, you don't tell Perl that there should be anything special done, so Perl assumes you want Latin-1 as output format. Latin-1 means no modification to your string is made. Your console expects UTF-8 and the two bytes that Perl outputs happen to map to Eacute.

Here, adding a binmode STDOUT, ':encoding(UTF-8)'; should Perl tell that you want UTF-8 on output, and using my $string= decode('UTF-8', "\x{c3}\x{a9}"); to tell Perl that you want the string parts to be interpreted as UTF-8 should change the program to suit what you want.

In your second program, you have a source file that is UTF-8. In the program, you hand Perl two octets that represent the UTF-8 encoding, and tell Perl that the program source is UTF-8. So Perl thinks this string should have length 1, because it consists of two bytes and is an "UTF-8" string. When printing your data, you don't tell Perl that there should be anything special done, so Perl assumes you want Latin-1 as output format. So Perl converts your UTF-8 string to Latin-1 when printing it. Your console expects UTF-8 and the single byte that Perl outputs happens to be an invalid UTF-8 sequence.

Here, you only need to tell Perl that you want UTF-8 on output by using binmode on STDOUT.

The two modules you use expect Unicode input, but you hand them byte sequences. You want to use Encode::decode to decode them to real Unicode strings:

use Encode 'decode';
my $string= decode 'UTF-8', "\x{c3}\x{a9}";
...
[download]

Comment on Re: Problems handling UTF8 ! And removing accents. Select or Download Code

Replies are listed 'Best First'.
Re^2: Problems handling UTF8 ! And removing accents. by prunkdump (Initiate) on Oct 28, 2014 at 09:25 UTC
Thank you very very much ! With your help and some research I have solved all my problems ! Baptiste.	[reply]