i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode'

bcrowell2 has asked for the wisdom of the Perl Monks concerning the following question:

O Monks,

I have a program that reads utf8 from a file, and writes utf8 to stdout. It's internationalized in a bunch of languages. The relevant section of the code seems to be the following:

binmode STDOUT, ":utf8"; # eliminates "Wide character in print" error 
+in Czech
use open ":encoding(utf8)"; # otherwise utf8 in input files is read as
+ if 1 character==1 byte
# The combination of two lines above is needed in order to get the fol
+lowing to work:
#    - Czech characters coded into the source print without the "Wide 
+character in print" error.
#    - Accented characters and Greek characters in the input file are 
+read properly and printed back out properly.
# When testing this, make sure to use a terminal such as mlterm that c
+an handle accented characters,
# and make sure that the --nofilter_accents_on_output has not been set
+ automatically based on the
# value of the $TERM variable. (Using mlterm prevents this.)
# See "man perlunicode".
# An example of the confusing way all of this works:
#    perl -e 'binmode STDOUT,":utf8"; print "\x{11b}\x{e9}"'
#    perl -e 'binmode STDOUT,":utf8"; print "\x{11b}\x{e9}"' >a.a
#    perl -e 'binmode STDOUT,":utf8"; open(F,"<a.a"); $x=<F>; close F;
+ print $x'
#    perl -e 'binmode STDOUT,":utf8"; open(F,"<a.a"); $x=<F>; close F;
+ print length $x'
#    perl -e 'use open ":encoding(utf8)"; binmode STDOUT,":utf8"; open
+(F,"<a.a"); $x=<F>; close F; print $x'
#    perl -e 'use open ":encoding(utf8)"; binmode STDOUT,":utf8"; open
+(F,"<a.a"); $x=<F>; close F; print length $x'

use utf8; # Indicates that source can contain utf8, which we use for t
+he Greek translation.
use locale;
[download]

As you can see from the length of the comments, it hasn't been as straightforward as I would have liked to make this Just Work for my users.

The latest problem has to do with the line 'binmode STDOUT, ":utf8";'. This was needed in order to avoid a "Wide character in print" error in Czech. However, adding that line seems to have broken the program for a Danish-speaking user. If he uses a utf8-encoded input file with a ligatured ae character (c3a6), he gets errors like 'utf8 "\xF8" does not map to Unicode at ./when line 1389, <FILE> line 29.' I do not get the same error on the same input file on my own machine. He's running Debian Etch with LANG=en_US.ISO-8859-15 LC_CTYPE=C, and a US keyboard layout. I'm running Ubuntu Gutsy with a US setup. I need to check back with him, but it sounds as though the utf8 codes that perl is complaining about are different than the ones that are actually in his input file -- they all have F and E in the LSB. (I'm checking back with him on this, since there's some confusion in the emails.)

The Wikipedia article on the ae character, http://en.wikipedia.org/wiki/%C3%86 , says it's unicode e6. Maybe this is a character that can be encoded in unicode in two different ways? If I display c3a6 in a unicode-aware terminal like mlterm, it does display as a ligatured ae. Maybe perl is trying to convert it to the single-character version, or something??

Does anyone have any clue what might be happening here?

TIA!

Comment on i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' Download Code

Replies are listed 'Best First'.
Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by Juerd (Abbot) on Feb 25, 2008 at 02:06 UTC
How are you READING the UTF-8 data? Outputting is hard to do wrong. Indeed you just set an :encoding or :utf8 layer on the output handle. However, if you use :utf8 for input, you're in for trouble (malfunction and security bugs). Always use :encoding for text input. The error message about 0xF8 (which is the Danish ų character, not ę, which is indeed 0xE6) suggests to me that the input is *NOT* UTF-8, but instead ISO-8859-1 or ISO-8859-15, and the :utf8 was used. Update: I meant :encoding(utf8) here. ":utf8" should of course not be used for input. If the input is ISO-8859, and the input layer is :utf8, you get lots of errors and you should be happy if any part of your program works correctly. Probably not the case here. If the input is ISO-8859, and the input layer is :encoding(utf8), you get substitution characters for practically all non-ASCII characters. The only correct way to read a ISO-8859-15 text file or stream, is to use :encoding(ISO-8859-15). This can be done automatically based on the locale, with "use open", see its documentation. Note that using that is likely to introduce problems for other users, especially those who don't have any locale, but do have a UTF-8 capable terminal. This, however, is not a Perl problem. If you haven't already done so, please forget everything you've ever read and learned about Perl unicode support, and read perlunitut. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^2: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by bcrowell2 (Friar) on Feb 25, 2008 at 02:29 UTC
Thanks, Juerd, for your reply. >How are you READING the UTF-8 data? ... However, if you use :utf8 for input, you're in for trouble (malfunction and security bugs). Always use :encoding for text input. I think that's what I did -- see the second line of the code: `use open ":encoding(utf8)";` >The error message about 0xF8 (which is the Danish ų character, not ę, which is indeed 0xE6) suggests to me that the input is NOT UTF-8, but instead ISO-8859-1 or ISO-8859-15, and the :utf8 was used. Aha. I checked the file the user sent me: `$ file a.a a.a: ISO-8859 text` [download] My program requires utf8 input, but the user was giving it iso-8859. I think when I cut and pasted it in a utf8-aware editor, it got changed into utf8. Although my documentation states that the input file has to be utf8, is there any way I can make an explicit check for a bogus encoding? I suppose the crudest thing I could do would be to look at the output of the unix "file" command, but I wonder if there's something more elegant.	[reply] [d/l] [select]
Re^3: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by Juerd (Abbot) on Feb 25, 2008 at 10:23 UTC
use open ":encoding(utf8)"; Good. My program requires utf8 input, but the user was giving it iso-8859. There's no easy way to fix the user. :) Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^2: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by bcrowell2 (Friar) on Feb 25, 2008 at 04:43 UTC
Okay, here's what I came up with to test whether a file is valid utf8. I'm sure there's also some way to do this using a cpan module. sub file_is_valid_utf8 { my $f = shift; open(F,"<:raw",$f) or return 0; local $/; my $x=<F>; close F; return is_valid_utf8($x); } # What's passed to this routine has to be a stream of bytes, not a utf +8 string in which the characters are complete utf8 characters. # That's why you typically want to call file_is_valid_utf8 rather than + calling this directly. sub is_valid_utf8 { my $x = shift; my $leading0 = '[\x{0}-\x{7f}]'; my $leading10 = '[\x{80}-\x{bf}]'; my $leading110 = '[\x{c0}-\x{df}]'; my $leading1110 = '[\x{e0}-\x{ef}]'; my $leading11110 = '[\x{f0}-\x{f7}]'; my $utf8 = "($leading0\|($leading110$leading10)\|($leading1110$leading +10$leading10)\|($leading11110$leading10$leading10$leading10))*"; return ($x=~/^$utf8$/); } [download]	[reply] [d/l]
Re^3: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by Juerd (Abbot) on Feb 25, 2008 at 15:04 UTC
If you have the raw bytestring, the easiest way to see if it's valid UTF-8 is to decode it to a unicode string. If that fails, it wasn't utf8 enough :) `utf8::decode($string) or die "Input is not valid UTF-8";` [download] or `utf8::decode(my $text = $binary) or die "Input is not valid UTF-8";` [download] If you leave out the "or die" clause, any invalid UTF-8 will just be seen as ISO-8859-1. Update: changed the examples as per ikegami's sound response. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply] [d/l] [select]
Re^4: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by ikegami (Patriarch) on Feb 25, 2008 at 17:40 UTC
Re^2: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by shagbark (Acolyte) on Oct 22, 2014 at 01:31 UTC
~~question about why not to use :utf8 that was answered above~~ ... anybody know how I can delete this comment?	[reply]
Re: i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by bcrowell2 (Friar) on Feb 25, 2008 at 23:13 UTC
Thanks, Juerg and ikegami, for your help! The utf8::decode solution is obviously cleaner (and probably faster) than my hand-coded version.	[reply]