prunkdump has asked for the wisdom of the Perl Monks concerning the following question:
Hello, I have many problems handling UTF8 with perl and I can't understand where they come from. All my system use UTF-8 and I have perl 5.14 that fully support it.
~# perl --version This is perl 5, version 14, subversion 2 (v5.14.2) ... ~# locale LANG=en_US.utf8 ...
But I have very strange outputs with perl. In french (and UTF8), the "é" character is encoded by "\x{c3}\x{a9}". But perl say that the "é" string have length = 2 :
#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; my $string = "\x{c3}\x{a9}"; print length($string)."\n"; print "$string\n"; ------------------ ~# ./test.pl 2 é
If I try to use directly UTF8 in the code with "use utf8", perl convert my "\x{c3}\x{a9}" character to "\x{e9}"!? The length is now =1 but I can't display the string ! I have checked that the "é" character is encoded by "\x{c3}\x{a9}" in the source code and ouputed with "\x{e9}" by perl.
#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; use utf8 my $string = "é"; print length($string)."\n"; print "$string\n"; ------------------ ~# ./test.pl 1 ?
If I read the "é" character from a file. I can display the string but I still have length = 2 !
#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; open(HDL,"./test.txt"); my $string = <HDL>; chomp($string); print length($string)."\n"; print "$string\n"; ------------------ ~# cat test.txt é ~# ./test.pl 2 é
I want to remove the accents of my strings. But Unicode::Normalize and Text::Unidecode seem to return inconsistent results !
Or#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; use Unicode::Normalize; my $string = "\x{c3}\x{a9}"; my $decomposed = NFD( $string ); $decomposed =~ s/\pM//g; $decomposed = NFC( $decomposed ); print length($decomposed)."\n"; print "$decomposed\n"; ------------------ ~# ./test.pl 2 A?
#! /usr/bin/perl -w use 5.014; use feature 'unicode_strings'; use strict; use warnings; use Text::Unidecode; my $string = "\x{c3}\x{a9}"; my $ascii = unidecode($string); print length($ascii)."\n"; print "$ascii\n"; ------------------ ~# ./test.pl 4 A(c)
I'm completely lost ! Can anyone help me? Thanks. Baptiste.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Problems handling UTF8 ! And removing accents.
by Corion (Patriarch) on Oct 26, 2014 at 09:34 UTC | |
by prunkdump (Initiate) on Oct 28, 2014 at 09:25 UTC | |
|
Re: Problems handling UTF8 ! And removing accents.
by Laurent_R (Canon) on Oct 26, 2014 at 09:57 UTC |