Re: Unicode problem with some letters

Perl can store Unicode strings internally in Latin-1 if no character in the string has a codepoint above 255.

That's what happens here, and it's why you don't get the "wide character" warning -- none of your characters is "wider" than 255.

Note that you can still treat $str (or $_) as a character string, and print it if you set up an :encoding(UTF-8) IO layer on STDOUT:

$ echo -e "\xC3\xA0" | perl -CS -pne 'BEGIN{binmode STDIN, ":utf8"}; $
+_= uc'
[download]

Update: on my perl (5.14.1) it seems that $_ is always stored in UTF8 internally, but still the point applies that no codepoint is > 255 in that string, so none is "wide".

Perl 6 - second systems done right

Comment on Re: Unicode problem with some letters Select or Download Code

Replies are listed 'Best First'.
Re^2: Unicode problem with some letters by OlegG (Monk) on Aug 21, 2011 at 18:25 UTC
Ok, thanks. But can you tell me why output without setting output layer to utf8 looks like "�"? Perl eats my data?	[reply]
Re^3: Unicode problem with some letters by moritz (Cardinal) on Aug 21, 2011 at 19:54 UTC
When you don't specify :utf8 or :encoding(UTF-8), Perl assumes Latin-1 (aka ISO-8859-1): `$ echo -e "\xC3\xA0" \| perl -pne 'BEGIN{binmode STDIN, ":utf8"}'\|hexdu +mp -C e0` [download] Latin-1 0xE0 encodes the codepoint U+00E0 LATIN SMALL LETTER A WITH GRAVE, which is the character that the UTF-8 string C3 A0 encodes. Since your terminal is configured to receive UTF-8 output (I suppose), it doesn't know what to do with perl's non-UTF-8 output, and shows the general "I'm confused" replacement character. Perl 6 - second systems done right	[reply] [d/l]
Re^4: Unicode problem with some letters by OlegG (Monk) on Aug 22, 2011 at 15:03 UTC
Thank you. Now I totally understand.	[reply]