OlegG has asked for the wisdom of the Perl Monks concerning the following question:

I faced out with some problems, when tried to read unicode strings from the file. It was smth like:
open FH, '<:utf8', 'input.txt' or die $! my $str = <FH>; close FH;
This file contains one letter. In the hex editor it looks like:
C3 A0
And in the text editor it is "à". I think it is italian.
When I tried to
print $str;
I expected "Wide character in print" and this letter after, but all that I get is question mark inside black box "�".
Here are quick examples:
echo -e "\xC3\xA0" | perl -pne 'BEGIN{binmode STDIN, ":utf8"}' echo -e "\xD0\xBC" | perl -pne 'BEGIN{binmode STDIN, ":utf8"}'
output of the first example is as described above, but second works as expected (with "Wide character in print" and expected letter - "м" - from the cyrillic alphabet)
Please, tell me where is a problem.

Replies are listed 'Best First'.
Re: Unicode problem with some letters
by moritz (Cardinal) on Aug 21, 2011 at 18:13 UTC

    Perl can store Unicode strings internally in Latin-1 if no character in the string has a codepoint above 255.

    That's what happens here, and it's why you don't get the "wide character" warning -- none of your characters is "wider" than 255.

    Note that you can still treat $str (or $_) as a character string, and print it if you set up an :encoding(UTF-8) IO layer on STDOUT:

    $ echo -e "\xC3\xA0" | perl -CS -pne 'BEGIN{binmode STDIN, ":utf8"}; $ +_= uc'

    Update: on my perl (5.14.1) it seems that $_ is always stored in UTF8 internally, but still the point applies that no codepoint is > 255 in that string, so none is "wide".

      Ok, thanks.
      But can you tell me why output without setting output layer to utf8 looks like "�"? Perl eats my data?

        When you don't specify :utf8 or :encoding(UTF-8), Perl assumes Latin-1 (aka ISO-8859-1):

        $ echo -e "\xC3\xA0" | perl -pne 'BEGIN{binmode STDIN, ":utf8"}'|hexdu +mp -C e0

        Latin-1 0xE0 encodes the codepoint U+00E0 LATIN SMALL LETTER A WITH GRAVE, which is the character that the UTF-8 string C3 A0 encodes.

        Since your terminal is configured to receive UTF-8 output (I suppose), it doesn't know what to do with perl's non-UTF-8 output, and shows the general "I'm confused" replacement character.

Re: Unicode problem with some letters
by Khen1950fx (Canon) on Aug 22, 2011 at 09:22 UTC
    I think that your problem boils down to three questions:

    What is my locale?
    Do I need to use binmode?
    How do I get the appropriate results?

    Knowing when to use binmode can be prove to be a real challenge at times, so I put this script together to answer those questions.
    #!/usr/bin/perl -l use strict; use warnings; use Encode; use Encode::Locale qw( $ENCODING_LOCALE ); use File::Util qw( needs_binmode ); if ( needs_binmode ne 0 ) { print "# ENCODING_LOCALE is $ENCODING_LOCALE"; print "# Needs binmode"; } if ( $ENCODING_LOCALE eq 'UTF-8' ) { my $str1 = "\xC3\xA0"; my $str2 = "\xD0\xBC"; print "$str1"; print "$str2"; } if( $ENCODING_LOCALE ne 'UTF-8' ) { binmode STDOUT, ':encoding(utf8)'; my $str1 = "\xC3\xA0"; my $str2 = "\xD0\xBC"; print "$str1"; print "$str2"; }
    Encode::Locale and File::Util are required.
      Looks portable. Thanks.
Re: Unicode problem with some letters
by zentara (Cardinal) on Aug 22, 2011 at 16:22 UTC