in reply to How to interpret characters in Devel::Peek CUR
The Unicode character U+201C LEFT DOUBLE QUOTATION MARK (“) is encoded in UTF-8 as the bytes e2 80 9c (\342\200\234), and the Unicode character U+201D RIGHT DOUBLE QUOTATION MARK (”) is encoded in UTF-8 as the bytes e2 80 9d (\342\200\235).
One way to think about Perl strings is that they store either a sequence of bytes or a sequence of Unicode characters. In your case, the Devel::Peek output does not include the "UTF8" flag, which means that this string is bytes, and yes, that's why you're getting a length of 30. (Update: It is important to note, however, that testing a string's UTF8 flag for anything other than debugging is code smell - your code should normally rely on the fact that you're getting strings in the correct format.)
You can decode bytes to characters or encode characters to bytes using the Encode module, or, in the case of UTF-8, use the "built-in" utf8 module (note that you don't have to put use utf8; in your code to load it; use utf8 means "this Perl source file is encoded in UTF-8", which may or may not be what you want). You can use utf8::decode($string); to decode the string you have, and then you'll see this output:
SV = PV(0x5584829062e0) at 0x558482ad2ee0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x558482b75b30 "Triple \342\200\234S\342\200\235 Industrial Cor +p"\0 [UTF8 "Triple \x{201c}S\x{201d} Industrial Corp"] CUR = 30 LEN = 32
And length will now report 26. The UTF8 flag means that the Perl string is storing Unicode characters (the fact that they're stored internally as UTF-8 should be considered an implementation detail). Almost all Perl operators (depending on the Perl version) and many Perl modules should handle Unicode correctly.
Note that it's usually best to decode data as it's coming into Perl (e.g. specifying an open mode of '<:encoding(UTF-8)') and encode it as it leaves, and having to do this manually in your code sometimes means that the source where you're getting the data may be buggy in regards to Unicode. I don't know enough about PHP::Serialization to say if that's the case here, and the PHP serialize docs don't make any mention of Unicode either. Interestingly, the PHP String docs say "PHP only supports a 256-character set, and hence does not offer native Unicode support." So my guess is that the encoding to bytes happens somewhere before the data hits the PHP string, and then serialize and PHP::Serialization simply pass those bytes through; this means you'd have to know which encoding was used to store the Unicode data into the PHP string to correctly decode it, in the case that it's not always UTF-8.
As a general note, if you're working with Unicode it's best to be on the latest version of Perl and to put a use 5.030; at the top of the file to enable all of its features.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How to interpret characters in Devel::Peek CUR
by ait (Hermit) on Jun 11, 2020 at 04:55 UTC | |
by Tux (Canon) on Jun 11, 2020 at 12:52 UTC | |
by ait (Hermit) on Jun 16, 2020 at 15:13 UTC | |
by haukex (Archbishop) on Jun 11, 2020 at 10:03 UTC |