prunkdump has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have many problems handling UTF8 with perl and I can't understand where they come from. All my system use UTF-8 and I have perl 5.14 that fully support it.

~# perl --version This is perl 5, version 14, subversion 2 (v5.14.2) ... ~# locale LANG=en_US.utf8 ...

But I have very strange outputs with perl. In french (and UTF8), the "é" character is encoded by "\x{c3}\x{a9}". But perl say that the "é" string have length = 2 :

#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; my $string = "\x{c3}\x{a9}"; print length($string)."\n"; print "$string\n"; ------------------ ~# ./test.pl 2 é

If I try to use directly UTF8 in the code with "use utf8", perl convert my "\x{c3}\x{a9}" character to "\x{e9}"!? The length is now =1 but I can't display the string ! I have checked that the "é" character is encoded by "\x{c3}\x{a9}" in the source code and ouputed with "\x{e9}" by perl.

#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; use utf8 my $string = "é"; print length($string)."\n"; print "$string\n"; ------------------ ~# ./test.pl 1 ?

If I read the "é" character from a file. I can display the string but I still have length = 2 !

#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; open(HDL,"./test.txt"); my $string = <HDL>; chomp($string); print length($string)."\n"; print "$string\n"; ------------------ ~# cat test.txt é ~# ./test.pl 2 é

I want to remove the accents of my strings. But Unicode::Normalize and Text::Unidecode seem to return inconsistent results !

#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; use Unicode::Normalize; my $string = "\x{c3}\x{a9}"; my $decomposed = NFD( $string ); $decomposed =~ s/\pM//g; $decomposed = NFC( $decomposed ); print length($decomposed)."\n"; print "$decomposed\n"; ------------------ ~# ./test.pl 2 A?
Or
#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; use Text::Unidecode; my $string = "\x{c3}\x{a9}"; my $ascii = unidecode($string); print length($ascii)."\n"; print "$ascii\n"; ------------------ ~# ./test.pl 4 A(c)

I'm completely lost ! Can anyone help me? Thanks. Baptiste.

Replies are listed 'Best First'.
Re: Problems handling UTF8 ! And removing accents.
by Corion (Patriarch) on Oct 26, 2014 at 09:34 UTC

    We need to separate the problem into three parts:

    1. The source of your data, and the encoding there
    2. The Perl program and how the string is marked there
    3. The output of your data, and the encoding there

    In your first program, you have a source file that is plain ASCII. In the program, you hand Perl two octets that represent the UTF-8 encoding. So Perl thinks this string should have length 2, because it consists of two bytes and is a "Latin-1" string. When printing your data, you don't tell Perl that there should be anything special done, so Perl assumes you want Latin-1 as output format. Latin-1 means no modification to your string is made. Your console expects UTF-8 and the two bytes that Perl outputs happen to map to Eacute.

    Here, adding a binmode STDOUT, ':encoding(UTF-8)'; should Perl tell that you want UTF-8 on output, and using my $string= decode('UTF-8', "\x{c3}\x{a9}"); to tell Perl that you want the string parts to be interpreted as UTF-8 should change the program to suit what you want.

    In your second program, you have a source file that is UTF-8. In the program, you hand Perl two octets that represent the UTF-8 encoding, and tell Perl that the program source is UTF-8. So Perl thinks this string should have length 1, because it consists of two bytes and is an "UTF-8" string. When printing your data, you don't tell Perl that there should be anything special done, so Perl assumes you want Latin-1 as output format. So Perl converts your UTF-8 string to Latin-1 when printing it. Your console expects UTF-8 and the single byte that Perl outputs happens to be an invalid UTF-8 sequence.

    Here, you only need to tell Perl that you want UTF-8 on output by using binmode on STDOUT.

    The two modules you use expect Unicode input, but you hand them byte sequences. You want to use Encode::decode to decode them to real Unicode strings:

    use Encode 'decode'; my $string= decode 'UTF-8', "\x{c3}\x{a9}"; ...

      Thank you very very much ! With your help and some research I have solved all my problems ! Baptiste.

Re: Problems handling UTF8 ! And removing accents.
by Laurent_R (Canon) on Oct 26, 2014 at 09:57 UTC