How to interpret characters in Devel::Peek CUR

ait has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to interpret characters in Devel::Peek CUR by haukex (Archbishop) on Jun 09, 2020 at 07:58 UTC
Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). The Unicode character `U+201C LEFT DOUBLE QUOTATION MARK` (“) is encoded in UTF-8 as the bytes `e2 80 9c` (`\342\200\234`), and the Unicode character `U+201D RIGHT DOUBLE QUOTATION MARK` (”) is encoded in UTF-8 as the bytes `e2 80 9d` (`\342\200\235`). One way to think about Perl strings is that they store either a sequence of bytes or a sequence of Unicode characters. In your case, the Devel::Peek output does not include the "`UTF8`" flag, which means that this string is bytes, and yes, that's why you're getting a length of 30. (Update: It is important to note, however, that testing a string's `UTF8` flag for anything other than debugging is code smell - your code should normally rely on the fact that you're getting strings in the correct format.) You can decode bytes to characters or encode characters to bytes using the Encode module, or, in the case of UTF-8, use the "built-in" utf8 module (note that you don't have to put `use utf8;` in your code to load it; `use utf8` means "this Perl source file is encoded in UTF-8", which may or may not be what you want). You can use `utf8::decode($string);` to decode the string you have, and then you'll see this output: `SV = PV(0x5584829062e0) at 0x558482ad2ee0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x558482b75b30 "Triple \342\200\234S\342\200\235 Industrial Cor +p"\0 [UTF8 "Triple \x{201c}S\x{201d} Industrial Corp"] CUR = 30 LEN = 32` [download] And length will now report 26. The `UTF8` flag means that the Perl string is storing Unicode characters (the fact that they're stored internally as UTF-8 should be considered an implementation detail). Almost all Perl operators (depending on the Perl version) and many Perl modules should handle Unicode correctly. Note that it's usually best to decode data as it's coming into Perl (e.g. specifying an open mode of `'<:encoding(UTF-8)'`) and encode it as it leaves, and having to do this manually in your code sometimes means that the source where you're getting the data may be buggy in regards to Unicode. I don't know enough about PHP::Serialization to say if that's the case here, and the PHP `serialize` docs don't make any mention of Unicode either. Interestingly, the PHP String docs say "PHP only supports a 256-character set, and hence does not offer native Unicode support." So my guess is that the encoding to bytes happens somewhere before the data hits the PHP string, and then `serialize` and PHP::Serialization simply pass those bytes through; this means you'd have to know which encoding was used to store the Unicode data into the PHP string to correctly decode it, in the case that it's not always UTF-8. As a general note, if you're working with Unicode it's best to be on the latest version of Perl and to put a `use 5.030;` at the top of the file to enable all of its features.	[reply] [d/l] [select]
Re^2: How to interpret characters in Devel::Peek CUR by ait (Hermit) on Jun 11, 2020 at 04:55 UTC
Thanks a lot for this detailed answer! I read the doc you recommended and although I knew some of the stuff in there, it is definitely a great read and clarifies part of the untold story. It also helped me understand your answer better, for example: the fact that they're stored internally as UTF-8 should be considered an implementation detail I think found what seems to be the root cause of the issue: We are pulling data from an SQL Server database that is encoded in CP-1252 and we are using the DBI with the MS ODCB Driver for Linux version 13. It seems they are inserting UTF-8 data into that SQL Server, so when we get the data back in Perl the UTF-8 flag is not set (even though some records actually contain UTF-8 characters). When we insert that data into our UTF-8 PostgreSQL debase, it seems to get double encoded. Also, some of these flawed records have a null terminator at the end too, which doesn't seem to affect the utf8 flag but it does mess up our trimming (The SQL Server char strings are padded with whitespace). Data from SQL Server `SV = PV(0x560c8bacdf90) at 0x560c8b9b7998 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x560c8bbfad00 "Triple \342\200\234S\342\200\235 Industrial Cor +p \0"\0 CUR = 51 LEN = 53 COW_REFCNT = 1` [download] Data after being Stored in Postgres (and retrieved) `SV = PV(0x560c8bacdec0) at 0x560c8bb7c7b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x560c8bbf38f0 "RAW: Triple \303\242\302\200\302\234S\303\242\3 +02\200\302\235 Industrial Corp "\0 [UTF8 "RAW: Tri +ple \x{e2}\x{80}\x{9c}S\x{e2}\x{80}\x{9d} Industrial Corp + "] CUR = 61 LEN = 63` [download] Using utf8::decode on the string before storing into Postgres actually solves the issue. So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?	[reply] [d/l] [select]
Re^3: How to interpret characters in Devel::Peek CUR by Tux (Canon) on Jun 11, 2020 at 12:52 UTC
As I previously commented here, much depends on the way you set up the connection, and there is still room to play with server-side encodings: I recently worked from perl on Linux with a MS SQL server database, and got the best results with FreeTDS: `my $dbh = DBI->connect ("dbi:ODBC:mssql_freetds", $username, $password +, \%dbi_attributes);` [download] `$ cat ~/.odbc.ini [mssql_freetds] Description = My MS SQL database Driver = FreeTDS TDS version = 7.2 Trace = No Server = mysql.server.local Port = 1433 Database = DatabaseName User = UserName Password = PassWord Client Charset = UTF-8` [download] The biggest difference between FreeTDS and the MS ODBC driver is the return type of UUID field. The MS ODBC does not allow nested queries, whereas the FreeTDS driver does. So I used the ODBC driver to make a CSV dump of the database and the FreeTDS driver to actually work with the database. For ODBC I did `my $dbh = DBI->connect ("dbi:ODBC:mssql_odbc", $username, $password, \ +%dbi_attributes);` [download] `$ cat ~/.odbc.ini [mssql_odbc] Description = My MS SQL database Driver = ODBC Driver 17 for SQL Server Server = mysql.server.local Database = DatabaseName User = UserName Password = PassWord` [download] Also make sure you put the fully qualified hostname in the server name. `localhost` will not work. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^4: How to interpret characters in Devel::Peek CUR by ait (Hermit) on Jun 16, 2020 at 15:13 UTC
Re^3: How to interpret characters in Devel::Peek CUR by haukex (Archbishop) on Jun 11, 2020 at 10:03 UTC
So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this? It's too bad that the server is out of your control, since that seems to be the source of the problem. But anyway, yes, I think fixing the issue as early as possible - as you pull the data off the server - is the "best" (relatively) way to go about it. Two things to keep in mind: Make sure that all the data really is UTF-8, and check the return value of `utf8::decode()`, because if that fails, then there's definitely something wrong with the encoding. But keep in mind that false negatives (e.g. data that is actually CP-1252 but also decodes as UTF-8) are possible, though somewhat unlikely.	[reply] [d/l]
Re: How to interpret characters in Devel::Peek CUR by kcott (Archbishop) on Jun 09, 2020 at 05:37 UTC
G'day ait, The characters, “ and ”, are U+201C and U+201D. The numbers \342\200\234 and \342\200\235 are the octal values of the bytes that make up those characters. You can break those characters into their constituent bytes and check the octal values like this: `$ perl -C -E ' my $x = "\x{201c}S\x{201d}"; say $x; { use bytes; printf "%vo\n", $x; } ' “S” 342.200.234.123.342.200.235` [download] See also: bytes noting the emboldened warning; and the vector flag information in sprintf. — Ken	[reply] [d/l]
Re^2: How to interpret characters in Devel::Peek CUR by haukex (Archbishop) on Jun 09, 2020 at 07:59 UTC
The characters, “ and ”, are U+201C and U+201D. The numbers \342\200\234 and \342\200\235 are the octal values of the bytes that make up those characters. Sorry, but this leaves out a very important bit: these are the bytes that make up the characters when encoded as UTF-8.	[reply]
Re^3: How to interpret characters in Devel::Peek CUR by Tux (Canon) on Jun 09, 2020 at 11:11 UTC
What he said :). In EBCDIC land you'd get something completely different: `$ perl -MData::Peek -wE'say $^O;DPeek ("\x{201c}"); DPeek ("\x{201d}") +' os390 PV("\312\101\160"\0) [UTF8 "\x{201c}"] PV("\312\101\161"\0) [UTF8 "\x{201d}"]` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l]
Re^2: How to interpret characters in Devel::Peek CUR by ait (Hermit) on Jun 11, 2020 at 17:00 UTC
Wow, thanks for the use bytes trick! Curiously I use Perl for another project where I translate REST into Modbus and I use a lot of pack and unpack, but I never used bytes before. Thanks!!	[reply]
Re^2: How to interpret characters in Devel::Peek CUR by ait (Hermit) on Jun 16, 2020 at 15:07 UTC
Thank you kcott ! The bytes nugget was a great tip!	[reply]
Re: How to interpret characters in Devel::Peek CUR by ikegami (Patriarch) on Jun 09, 2020 at 17:07 UTC
So given this string: Triple “S” Industrial Corp (note funky quotes) More precisely, you have this text encoded using UTF-8. What are the characters \342\200\234 (the left funky quote) Octal escape sequences that produce the bytes that form the encoding of «`“`» using UTF-8. `use feature qw( say ); use Encode qw( encode ); say encode("UTF-8", "\N{LEFT DOUBLE QUOTATION MARK}") eq "\342\200\234"; # Output: 1` [download] How would I manually decode them if I wanted to ? You could use `utf8::decode($s);` [download] If this string was constructed from a string literal, then you should have used the following to tell Perl the source was encoded using UTF-8 instead of ASCII: `use utf8;` [download] If this is read from a file, an encoding layer would do this automatically for you. You can set this up using `use open ':std', ':encoding(UTF-8)';` [download] Is this is why CUR reports 30 "perl characters" instead of 26 actual characters? The string has 30 characters, not 26. You can verify this using `length`. If you were to decode those 30 bytes, you would get 26 Unicode Code Points, but that would be a different string, and `length` would return 26. `use feature qw( say ); use Encode qw( decode ); no utf8; my $utf8 = "Triple “S” Industrial Corp"; say length($utf8); # 30 chars my $ucp = decode("UTF-8", $utf8); say length($ucp); # 26 chars` [download] That said, `CUR` indicates the number of bytes of the string buffer that are being used, not the number of characters in the string. They just happen to be the same for your string. `use feature qw( say ); use Encode qw( decode ); use Devel::Peek qw( Dump ); no utf8; my $utf8 = "Triple “S” Industrial Corp"; say length($utf8); # 30 chars Dump($utf8); # CUR = 30 my $ucp = decode("UTF-8", $utf8); say length($ucp); # 26 chars Dump($ucp); # CUR = 30` [download] Because we called `length` before `Dump`, you'll see the `PERL_MAGIC_utf8` (`w`) magic was added to cache the length (`MG_LEN = 26`).	[reply] [d/l] [select]
Re^2: How to interpret characters in Devel::Peek CUR by ait (Hermit) on Jun 16, 2020 at 15:18 UTC
Thanks ikegami for taking the time to show TMTOWTDI with built in itf8 and with Encode!	[reply]