in reply to UTF-8 representation question

First, the code you posted and the code you ran is actually different. There's only one byte in the string assigned to $test in the code you posted. There was two in the code you ran.

Second, you'll find Devel::Peek's Dump more useful at debugging this kind of problem.

Internally >perl -MDevel::Peek -e"Dump(qq{\x{C2AE}});" __ encoded SV = PV(0x23734c) at 0x235fc0 / as UTF-8 REFCNT = 1 / FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK,UTF8) PV = 0x182f574 "\354\212\256"\0 [UTF8 "\x{c2ae}"] CUR = 3 -----v------ ---v---- LEN = 4 \ \____ String \ \_______________________ Internal encoding (bytes) >perl -MDevel::Peek -e"Dump('abc');" SV = PV(0x236e00) at 0x182f778 REFCNT = 1 FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK) PV = 0x182f56c "abc"\0 CUR = 3 -v- Bytes treated as LEN = 4 \________ iso-latin-1 by perl functions

Can anyone tell me why the following code:

It all boils down to the following: Until you tell it otherwise by using "use utf8;", perl treats source code as iso-latin-1. You created an UTF-8 source file, and failed to notify perl.

$test is assigned two bytes or two iso-latin-1 chars. (Same thing, the difference is in how its used.) Adding "use utf8;" before the constant will cause the constant to be decoded from UTF-8.

When you call encode('UTF-8'), you're encoding characters that you've never decoded, producing junk.

On a related note, what is the difference between:
$test = "\x{05D0}\x{20AC}";
and
$test = "\x05\xD0\x20\xAC";

The first is a string of two UNICODE characters (internally encoded as UTF-8).
The second is a string of four bytes or four iso-latin-1 characters.

>perl -MDevel::Peek -e"Dump(qq{\x{05D0}\x{20AC}});" SV = PV(0x237354) at 0x235fc8 REFCNT = 1 FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK,UTF8) PV = 0x18307bc "\327\220\342\202\254"\0 [UTF8 "\x{5d0}\x{20ac}"] CUR = 5 LEN = 8 >perl -MDevel::Peek -e"Dump(qq{\x05\xD0\x20\xAC});" SV = PV(0x237354) at 0x235fc8 REFCNT = 1 FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK) PV = 0x18307bc "\5\320 \254"\0 CUR = 4 LEN = 8

Replies are listed 'Best First'.
Re^2: UTF-8 representation question
by bpa (Novice) on Sep 04, 2008 at 03:22 UTC
    Thank you, ikegami, for your extremely informative reply. I see your point about not notifying perl I was using UTf-8. When I include the 'use utf8' (without the semicolon) in front of my $test assignment line, the length() function still tells me that the length of $test is 2 chars, but it does print out correctly to the shell as the registered trademark symbol, instead of junk. If I insert the semicolon, it prints junk. Is it possible that my shell is interpreting the output differently than perl? I suppose I should read up on Perl's inner workings. One minor note however, the code I posted was in fact the code that I ran, so I am still confused on that point. I appreciate your help.

      Is it possible that my shell is interpreting the output differently than perl?

      It's all bytes until you pass them to something that cares, such as lc or your shell. Each of those "somethings" will interpret the bytes as it sees fit.

      One minor note however, the code I posted was in fact the code that I ran, so I am still confused on that point. I appreciate your help.

      There are a number of stetps between the file and Perlmonks where the bytes could have been substituted, plus PerlMonks itself and my browser.

      If I insert the semicolon, it prints junk.

      So far, we've only covered decoding the source. Sounds like you didn't properly encode the data while printing it. One way:

      #!/usr/bin/perl use strict; use warnings; # Decode source from UTF-8. use utf8; # Decode STDIN as per locale. # Encode STDOUT & STDERR as per locale. use open qw( :std :locale ); my $test = '...'; print($test); Dump($test);

      Or if you want to decode/encode your input/output using a specific encoding, you can do it as follows:

      #!/usr/bin/perl use strict; use warnings; # Decode source from UTF-8. use utf8; # Expect UTF-8 from STDIN. # Send UTF-8 to STDOUT & STDERR. BEGIN { binmode STDIN, ':encoding(UTF-8)' or die; binmode STDOUT, ':encoding(UTF-8)' or die; binmode STDERR, ':encoding(UTF-8)' or die; } my $test = '...'; print($test); Dump($test);

      Feel free to replace "..." with characters of your choice. If you're still having problem, please provide the Dump output, a description of what you see from the print (primarily the number of characters you see), how many characters you are expecting to see, and which of the two programs produced the output.