in reply to How to interpret characters in Devel::Peek CUR

Please see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

The Unicode character U+201C LEFT DOUBLE QUOTATION MARK (“) is encoded in UTF-8 as the bytes e2 80 9c (\342\200\234), and the Unicode character U+201D RIGHT DOUBLE QUOTATION MARK (”) is encoded in UTF-8 as the bytes e2 80 9d (\342\200\235).

One way to think about Perl strings is that they store either a sequence of bytes or a sequence of Unicode characters. In your case, the Devel::Peek output does not include the "UTF8" flag, which means that this string is bytes, and yes, that's why you're getting a length of 30. (Update: It is important to note, however, that testing a string's UTF8 flag for anything other than debugging is code smell - your code should normally rely on the fact that you're getting strings in the correct format.)

You can decode bytes to characters or encode characters to bytes using the Encode module, or, in the case of UTF-8, use the "built-in" utf8 module (note that you don't have to put use utf8; in your code to load it; use utf8 means "this Perl source file is encoded in UTF-8", which may or may not be what you want). You can use utf8::decode($string); to decode the string you have, and then you'll see this output:

SV = PV(0x5584829062e0) at 0x558482ad2ee0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x558482b75b30 "Triple \342\200\234S\342\200\235 Industrial Cor +p"\0 [UTF8 "Triple \x{201c}S\x{201d} Industrial Corp"] CUR = 30 LEN = 32

And length will now report 26. The UTF8 flag means that the Perl string is storing Unicode characters (the fact that they're stored internally as UTF-8 should be considered an implementation detail). Almost all Perl operators (depending on the Perl version) and many Perl modules should handle Unicode correctly.

Note that it's usually best to decode data as it's coming into Perl (e.g. specifying an open mode of '<:encoding(UTF-8)') and encode it as it leaves, and having to do this manually in your code sometimes means that the source where you're getting the data may be buggy in regards to Unicode. I don't know enough about PHP::Serialization to say if that's the case here, and the PHP serialize docs don't make any mention of Unicode either. Interestingly, the PHP String docs say "PHP only supports a 256-character set, and hence does not offer native Unicode support." So my guess is that the encoding to bytes happens somewhere before the data hits the PHP string, and then serialize and PHP::Serialization simply pass those bytes through; this means you'd have to know which encoding was used to store the Unicode data into the PHP string to correctly decode it, in the case that it's not always UTF-8.

As a general note, if you're working with Unicode it's best to be on the latest version of Perl and to put a use 5.030; at the top of the file to enable all of its features.

Replies are listed 'Best First'.
Re^2: How to interpret characters in Devel::Peek CUR
by ait (Hermit) on Jun 11, 2020 at 04:55 UTC

    Thanks a lot for this detailed answer!

    I read the doc you recommended and although I knew some of the stuff in there, it is definitely a great read and clarifies part of the untold story. It also helped me understand your answer better, for example:

    the fact that they're stored internally as UTF-8 should be considered an implementation detail

    I think found what seems to be the root cause of the issue:

    We are pulling data from an SQL Server database that is encoded in CP-1252 and we are using the DBI with the MS ODCB Driver for Linux version 13. It seems they are inserting UTF-8 data into that SQL Server, so when we get the data back in Perl the UTF-8 flag is not set (even though some records actually contain UTF-8 characters).

    When we insert that data into our UTF-8 PostgreSQL debase, it seems to get double encoded. Also, some of these flawed records have a null terminator at the end too, which doesn't seem to affect the utf8 flag but it does mess up our trimming (The SQL Server char strings are padded with whitespace).

    Data from SQL Server

    SV = PV(0x560c8bacdf90) at 0x560c8b9b7998 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x560c8bbfad00 "Triple \342\200\234S\342\200\235 Industrial Cor +p \0"\0 CUR = 51 LEN = 53 COW_REFCNT = 1

    Data after being Stored in Postgres (and retrieved)

    SV = PV(0x560c8bacdec0) at 0x560c8bb7c7b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x560c8bbf38f0 "RAW: Triple \303\242\302\200\302\234S\303\242\3 +02\200\302\235 Industrial Corp "\0 [UTF8 "RAW: Tri +ple \x{e2}\x{80}\x{9c}S\x{e2}\x{80}\x{9d} Industrial Corp + "] CUR = 61 LEN = 63

    Using utf8::decode on the string before storing into Postgres actually solves the issue. So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?

      As I previously commented here, much depends on the way you set up the connection, and there is still room to play with server-side encodings:

      I recently worked from perl on Linux with a MS SQL server database, and got the best results with FreeTDS:

      my $dbh = DBI->connect ("dbi:ODBC:mssql_freetds", $username, $password +, \%dbi_attributes);
      $ cat ~/.odbc.ini [mssql_freetds] Description = My MS SQL database Driver = FreeTDS TDS version = 7.2 Trace = No Server = mysql.server.local Port = 1433 Database = DatabaseName User = UserName Password = PassWord Client Charset = UTF-8

      The biggest difference between FreeTDS and the MS ODBC driver is the return type of UUID field. The MS ODBC does not allow nested queries, whereas the FreeTDS driver does. So I used the ODBC driver to make a CSV dump of the database and the FreeTDS driver to actually work with the database.

      For ODBC I did

      my $dbh = DBI->connect ("dbi:ODBC:mssql_odbc", $username, $password, \ +%dbi_attributes);
      $ cat ~/.odbc.ini [mssql_odbc] Description = My MS SQL database Driver = ODBC Driver 17 for SQL Server Server = mysql.server.local Database = DatabaseName User = UserName Password = PassWord

      Also make sure you put the fully qualified hostname in the server name. localhost will not work.


      Enjoy, Have FUN! H.Merijn

        Thanks Tux for your detailed response on the drivers ! After reading your response I am regretting not having looked at FreeTDS. We have to open a bunch of separate connections for sub-queries and have had all sorts of issues and quirks on truncation, and not sure if the stupid ODBC driver is doing some strange padding, etc. Anyway, we are almost done with this work and too late to change driver, but I wish I would have asked here an opinion on DBI and SQLServer before engaging this project :-(

      So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?

      It's too bad that the server is out of your control, since that seems to be the source of the problem. But anyway, yes, I think fixing the issue as early as possible - as you pull the data off the server - is the "best" (relatively) way to go about it. Two things to keep in mind: Make sure that all the data really is UTF-8, and check the return value of utf8::decode(), because if that fails, then there's definitely something wrong with the encoding. But keep in mind that false negatives (e.g. data that is actually CP-1252 but also decodes as UTF-8) are possible, though somewhat unlikely.