O Monks,

I have a program that reads utf8 from a file, and writes utf8 to stdout. It's internationalized in a bunch of languages. The relevant section of the code seems to be the following:

binmode STDOUT, ":utf8"; # eliminates "Wide character in print" error +in Czech use open ":encoding(utf8)"; # otherwise utf8 in input files is read as + if 1 character==1 byte # The combination of two lines above is needed in order to get the fol +lowing to work: # - Czech characters coded into the source print without the "Wide +character in print" error. # - Accented characters and Greek characters in the input file are +read properly and printed back out properly. # When testing this, make sure to use a terminal such as mlterm that c +an handle accented characters, # and make sure that the --nofilter_accents_on_output has not been set + automatically based on the # value of the $TERM variable. (Using mlterm prevents this.) # See "man perlunicode". # An example of the confusing way all of this works: # perl -e 'binmode STDOUT,":utf8"; print "\x{11b}\x{e9}"' # perl -e 'binmode STDOUT,":utf8"; print "\x{11b}\x{e9}"' >a.a # perl -e 'binmode STDOUT,":utf8"; open(F,"<a.a"); $x=<F>; close F; + print $x' # perl -e 'binmode STDOUT,":utf8"; open(F,"<a.a"); $x=<F>; close F; + print length $x' # perl -e 'use open ":encoding(utf8)"; binmode STDOUT,":utf8"; open +(F,"<a.a"); $x=<F>; close F; print $x' # perl -e 'use open ":encoding(utf8)"; binmode STDOUT,":utf8"; open +(F,"<a.a"); $x=<F>; close F; print length $x' use utf8; # Indicates that source can contain utf8, which we use for t +he Greek translation. use locale;

As you can see from the length of the comments, it hasn't been as straightforward as I would have liked to make this Just Work for my users.

The latest problem has to do with the line 'binmode STDOUT, ":utf8";'. This was needed in order to avoid a "Wide character in print" error in Czech. However, adding that line seems to have broken the program for a Danish-speaking user. If he uses a utf8-encoded input file with a ligatured ae character (c3a6), he gets errors like 'utf8 "\xF8" does not map to Unicode at ./when line 1389, <FILE> line 29.' I do not get the same error on the same input file on my own machine. He's running Debian Etch with LANG=en_US.ISO-8859-15 LC_CTYPE=C, and a US keyboard layout. I'm running Ubuntu Gutsy with a US setup. I need to check back with him, but it sounds as though the utf8 codes that perl is complaining about are different than the ones that are actually in his input file -- they all have F and E in the LSB. (I'm checking back with him on this, since there's some confusion in the emails.)

The Wikipedia article on the ae character, http://en.wikipedia.org/wiki/%C3%86 , says it's unicode e6. Maybe this is a character that can be encoded in unicode in two different ways? If I display c3a6 in a unicode-aware terminal like mlterm, it does display as a ligatured ae. Maybe perl is trying to convert it to the single-character version, or something??

Does anyone have any clue what might be happening here?

TIA!


In reply to i18n/utf8 problem, 'utf8 "\xF8" does not map to Unicode' by bcrowell2

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.