in reply to Unicode problem with some letters

Perl can store Unicode strings internally in Latin-1 if no character in the string has a codepoint above 255.

That's what happens here, and it's why you don't get the "wide character" warning -- none of your characters is "wider" than 255.

Note that you can still treat $str (or $_) as a character string, and print it if you set up an :encoding(UTF-8) IO layer on STDOUT:

$ echo -e "\xC3\xA0" | perl -CS -pne 'BEGIN{binmode STDIN, ":utf8"}; $ +_= uc'

Update: on my perl (5.14.1) it seems that $_ is always stored in UTF8 internally, but still the point applies that no codepoint is > 255 in that string, so none is "wide".

Replies are listed 'Best First'.
Re^2: Unicode problem with some letters
by OlegG (Monk) on Aug 21, 2011 at 18:25 UTC
    Ok, thanks.
    But can you tell me why output without setting output layer to utf8 looks like "�"? Perl eats my data?

      When you don't specify :utf8 or :encoding(UTF-8), Perl assumes Latin-1 (aka ISO-8859-1):

      $ echo -e "\xC3\xA0" | perl -pne 'BEGIN{binmode STDIN, ":utf8"}'|hexdu +mp -C e0

      Latin-1 0xE0 encodes the codepoint U+00E0 LATIN SMALL LETTER A WITH GRAVE, which is the character that the UTF-8 string C3 A0 encodes.

      Since your terminal is configured to receive UTF-8 output (I suppose), it doesn't know what to do with perl's non-UTF-8 output, and shows the general "I'm confused" replacement character.

        Thank you. Now I totally understand.