in reply to How to interpret characters in Devel::Peek CUR
So given this string: Triple “S” Industrial Corp (note funky quotes)
More precisely, you have this text encoded using UTF-8.
What are the characters \342\200\234 (the left funky quote)
Octal escape sequences that produce the bytes that form the encoding of «“» using UTF-8.
use feature qw( say ); use Encode qw( encode ); say encode("UTF-8", "\N{LEFT DOUBLE QUOTATION MARK}") eq "\342\200\234"; # Output: 1
You could useHow would I manually decode them if I wanted to ?
utf8::decode($s);
If this string was constructed from a string literal, then you should have used the following to tell Perl the source was encoded using UTF-8 instead of ASCII:
use utf8;
If this is read from a file, an encoding layer would do this automatically for you. You can set this up using
use open ':std', ':encoding(UTF-8)';
Is this is why CUR reports 30 "perl characters" instead of 26 actual characters?
The string has 30 characters, not 26. You can verify this using length. If you were to decode those 30 bytes, you would get 26 Unicode Code Points, but that would be a different string, and length would return 26.
use feature qw( say ); use Encode qw( decode ); no utf8; my $utf8 = "Triple “S” Industrial Corp"; say length($utf8); # 30 chars my $ucp = decode("UTF-8", $utf8); say length($ucp); # 26 chars
That said, CUR indicates the number of bytes of the string buffer that are being used, not the number of characters in the string. They just happen to be the same for your string.
use feature qw( say ); use Encode qw( decode ); use Devel::Peek qw( Dump ); no utf8; my $utf8 = "Triple “S” Industrial Corp"; say length($utf8); # 30 chars Dump($utf8); # CUR = 30 my $ucp = decode("UTF-8", $utf8); say length($ucp); # 26 chars Dump($ucp); # CUR = 30
Because we called length before Dump, you'll see the PERL_MAGIC_utf8 (w) magic was added to cache the length (MG_LEN = 26).
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How to interpret characters in Devel::Peek CUR
by ait (Hermit) on Jun 16, 2020 at 15:18 UTC |