Unicode problem with some letters

OlegG has asked for the wisdom of the Perl Monks concerning the following question:

I faced out with some problems, when tried to read unicode strings from the file. It was smth like:

open FH, '<:utf8', 'input.txt'
    or die $!
my $str = <FH>;
close FH;
[download]

This file contains one letter. In the hex editor it looks like:

C3 A0
[download]

And in the text editor it is "à". I think it is italian.
When I tried to

print $str;
[download]

I expected "Wide character in print" and this letter after, but all that I get is question mark inside black box "�".
Here are quick examples:

echo -e "\xC3\xA0" | perl -pne 'BEGIN{binmode STDIN, ":utf8"}'
echo -e "\xD0\xBC" | perl -pne 'BEGIN{binmode STDIN, ":utf8"}'
[download]

output of the first example is as described above, but second works as expected (with "Wide character in print" and expected letter - "м" - from the cyrillic alphabet)
Please, tell me where is a problem.

Comment on Unicode problem with some letters Select or Download Code

Replies are listed 'Best First'.
Re: Unicode problem with some letters by moritz (Cardinal) on Aug 21, 2011 at 18:13 UTC
Perl can store Unicode strings internally in Latin-1 if no character in the string has a codepoint above 255. That's what happens here, and it's why you don't get the "wide character" warning -- none of your characters is "wider" than 255. Note that you can still treat $str (or $_) as a character string, and print it if you set up an `:encoding(UTF-8)` IO layer on STDOUT: `$ echo -e "\xC3\xA0" \| perl -CS -pne 'BEGIN{binmode STDIN, ":utf8"}; $ +_= uc'` [download] Update: on my perl (5.14.1) it seems that $_ is always stored in UTF8 internally, but still the point applies that no codepoint is > 255 in that string, so none is "wide". Perl 6 - second systems done right	[reply] [d/l] [select]
Re^2: Unicode problem with some letters by OlegG (Monk) on Aug 21, 2011 at 18:25 UTC
Ok, thanks. But can you tell me why output without setting output layer to utf8 looks like "�"? Perl eats my data?	[reply]
Re^3: Unicode problem with some letters by moritz (Cardinal) on Aug 21, 2011 at 19:54 UTC
When you don't specify :utf8 or :encoding(UTF-8), Perl assumes Latin-1 (aka ISO-8859-1): `$ echo -e "\xC3\xA0" \| perl -pne 'BEGIN{binmode STDIN, ":utf8"}'\|hexdu +mp -C e0` [download] Latin-1 0xE0 encodes the codepoint U+00E0 LATIN SMALL LETTER A WITH GRAVE, which is the character that the UTF-8 string C3 A0 encodes. Since your terminal is configured to receive UTF-8 output (I suppose), it doesn't know what to do with perl's non-UTF-8 output, and shows the general "I'm confused" replacement character. Perl 6 - second systems done right	[reply] [d/l]
Re^4: Unicode problem with some letters by OlegG (Monk) on Aug 22, 2011 at 15:03 UTC
Re: Unicode problem with some letters by Khen1950fx (Canon) on Aug 22, 2011 at 09:22 UTC
I think that your problem boils down to three questions: What is my locale? Do I need to use binmode? How do I get the appropriate results? Knowing when to use binmode can be prove to be a real challenge at times, so I put this script together to answer those questions. #!/usr/bin/perl -l use strict; use warnings; use Encode; use Encode::Locale qw( $ENCODING_LOCALE ); use File::Util qw( needs_binmode ); if ( needs_binmode ne 0 ) { print "# ENCODING_LOCALE is $ENCODING_LOCALE"; print "# Needs binmode"; } if ( $ENCODING_LOCALE eq 'UTF-8' ) { my $str1 = "\xC3\xA0"; my $str2 = "\xD0\xBC"; print "$str1"; print "$str2"; } if( $ENCODING_LOCALE ne 'UTF-8' ) { binmode STDOUT, ':encoding(utf8)'; my $str1 = "\xC3\xA0"; my $str2 = "\xD0\xBC"; print "$str1"; print "$str2"; } [download] Encode::Locale and File::Util are required.	[reply] [d/l]
Re^2: Unicode problem with some letters by OlegG (Monk) on Aug 22, 2011 at 15:06 UTC
Looks portable. Thanks.	[reply]
Re: Unicode problem with some letters by zentara (Cardinal) on Aug 22, 2011 at 16:22 UTC
A very nice explanation of unicode strings starts on page 17 of Modern Perl the free book in the section called Unicode and Strings. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]