Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

The Unicode character U+201C LEFT DOUBLE QUOTATION MARK (“) is encoded in UTF-8 as the bytes e2 80 9c (\342\200\234), and the Unicode character U+201D RIGHT DOUBLE QUOTATION MARK (”) is encoded in UTF-8 as the bytes e2 80 9d (\342\200\235).

One way to think about Perl strings is that they store either a sequence of bytes or a sequence of Unicode characters. In your case, the Devel::Peek output does not include the "UTF8" flag, which means that this string is bytes, and yes, that's why you're getting a length of 30. (Update: It is important to note, however, that testing a string's UTF8 flag for anything other than debugging is code smell - your code should normally rely on the fact that you're getting strings in the correct format.)

You can decode bytes to characters or encode characters to bytes using the Encode module, or, in the case of UTF-8, use the "built-in" utf8 module (note that you don't have to put use utf8; in your code to load it; use utf8 means "this Perl source file is encoded in UTF-8", which may or may not be what you want). You can use utf8::decode($string); to decode the string you have, and then you'll see this output:

SV = PV(0x5584829062e0) at 0x558482ad2ee0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x558482b75b30 "Triple \342\200\234S\342\200\235 Industrial Cor +p"\0 [UTF8 "Triple \x{201c}S\x{201d} Industrial Corp"] CUR = 30 LEN = 32

And length will now report 26. The UTF8 flag means that the Perl string is storing Unicode characters (the fact that they're stored internally as UTF-8 should be considered an implementation detail). Almost all Perl operators (depending on the Perl version) and many Perl modules should handle Unicode correctly.

Note that it's usually best to decode data as it's coming into Perl (e.g. specifying an open mode of '<:encoding(UTF-8)') and encode it as it leaves, and having to do this manually in your code sometimes means that the source where you're getting the data may be buggy in regards to Unicode. I don't know enough about PHP::Serialization to say if that's the case here, and the PHP serialize docs don't make any mention of Unicode either. Interestingly, the PHP String docs say "PHP only supports a 256-character set, and hence does not offer native Unicode support." So my guess is that the encoding to bytes happens somewhere before the data hits the PHP string, and then serialize and PHP::Serialization simply pass those bytes through; this means you'd have to know which encoding was used to store the Unicode data into the PHP string to correctly decode it, in the case that it's not always UTF-8.

As a general note, if you're working with Unicode it's best to be on the latest version of Perl and to put a use 5.030; at the top of the file to enable all of its features.


In reply to Re: How to interpret characters in Devel::Peek CUR by haukex
in thread How to interpret characters in Devel::Peek CUR by ait

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.