bcrowell2 has asked for the wisdom of the Perl Monks concerning the following question:
O Monks,
I have a program that reads utf8 from a file, and writes utf8 to stdout. It's internationalized in a bunch of languages. The relevant section of the code seems to be the following:
binmode STDOUT, ":utf8"; # eliminates "Wide character in print" error +in Czech use open ":encoding(utf8)"; # otherwise utf8 in input files is read as + if 1 character==1 byte # The combination of two lines above is needed in order to get the fol +lowing to work: # - Czech characters coded into the source print without the "Wide +character in print" error. # - Accented characters and Greek characters in the input file are +read properly and printed back out properly. # When testing this, make sure to use a terminal such as mlterm that c +an handle accented characters, # and make sure that the --nofilter_accents_on_output has not been set + automatically based on the # value of the $TERM variable. (Using mlterm prevents this.) # See "man perlunicode". # An example of the confusing way all of this works: # perl -e 'binmode STDOUT,":utf8"; print "\x{11b}\x{e9}"' # perl -e 'binmode STDOUT,":utf8"; print "\x{11b}\x{e9}"' >a.a # perl -e 'binmode STDOUT,":utf8"; open(F,"<a.a"); $x=<F>; close F; + print $x' # perl -e 'binmode STDOUT,":utf8"; open(F,"<a.a"); $x=<F>; close F; + print length $x' # perl -e 'use open ":encoding(utf8)"; binmode STDOUT,":utf8"; open +(F,"<a.a"); $x=<F>; close F; print $x' # perl -e 'use open ":encoding(utf8)"; binmode STDOUT,":utf8"; open +(F,"<a.a"); $x=<F>; close F; print length $x' use utf8; # Indicates that source can contain utf8, which we use for t +he Greek translation. use locale;
As you can see from the length of the comments, it hasn't been as straightforward as I would have liked to make this Just Work for my users.
The latest problem has to do with the line 'binmode STDOUT, ":utf8";'. This was needed in order to avoid a "Wide character in print" error in Czech. However, adding that line seems to have broken the program for a Danish-speaking user. If he uses a utf8-encoded input file with a ligatured ae character (c3a6), he gets errors like 'utf8 "\xF8" does not map to Unicode at ./when line 1389, <FILE> line 29.' I do not get the same error on the same input file on my own machine. He's running Debian Etch with LANG=en_US.ISO-8859-15 LC_CTYPE=C, and a US keyboard layout. I'm running Ubuntu Gutsy with a US setup. I need to check back with him, but it sounds as though the utf8 codes that perl is complaining about are different than the ones that are actually in his input file -- they all have F and E in the LSB. (I'm checking back with him on this, since there's some confusion in the emails.)
The Wikipedia article on the ae character, http://en.wikipedia.org/wiki/%C3%86 , says it's unicode e6. Maybe this is a character that can be encoded in unicode in two different ways? If I display c3a6 in a unicode-aware terminal like mlterm, it does display as a ligatured ae. Maybe perl is trying to convert it to the single-character version, or something??
Does anyone have any clue what might be happening here?
TIA!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
by Juerd (Abbot) on Feb 25, 2008 at 02:06 UTC | |
by bcrowell2 (Friar) on Feb 25, 2008 at 02:29 UTC | |
by Juerd (Abbot) on Feb 25, 2008 at 10:23 UTC | |
by bcrowell2 (Friar) on Feb 25, 2008 at 04:43 UTC | |
by Juerd (Abbot) on Feb 25, 2008 at 15:04 UTC | |
by ikegami (Patriarch) on Feb 25, 2008 at 17:40 UTC | |
by shagbark (Acolyte) on Oct 22, 2014 at 01:31 UTC | |
|
Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'
by bcrowell2 (Friar) on Feb 25, 2008 at 23:13 UTC |