Thanks a lot for this detailed answer!

I read the doc you recommended and although I knew some of the stuff in there, it is definitely a great read and clarifies part of the untold story. It also helped me understand your answer better, for example:

the fact that they're stored internally as UTF-8 should be considered an implementation detail

I think found what seems to be the root cause of the issue:

We are pulling data from an SQL Server database that is encoded in CP-1252 and we are using the DBI with the MS ODCB Driver for Linux version 13. It seems they are inserting UTF-8 data into that SQL Server, so when we get the data back in Perl the UTF-8 flag is not set (even though some records actually contain UTF-8 characters).

When we insert that data into our UTF-8 PostgreSQL debase, it seems to get double encoded. Also, some of these flawed records have a null terminator at the end too, which doesn't seem to affect the utf8 flag but it does mess up our trimming (The SQL Server char strings are padded with whitespace).

Data from SQL Server

SV = PV(0x560c8bacdf90) at 0x560c8b9b7998 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x560c8bbfad00 "Triple \342\200\234S\342\200\235 Industrial Cor +p \0"\0 CUR = 51 LEN = 53 COW_REFCNT = 1

Data after being Stored in Postgres (and retrieved)

SV = PV(0x560c8bacdec0) at 0x560c8bb7c7b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x560c8bbf38f0 "RAW: Triple \303\242\302\200\302\234S\303\242\3 +02\200\302\235 Industrial Corp "\0 [UTF8 "RAW: Tri +ple \x{e2}\x{80}\x{9c}S\x{e2}\x{80}\x{9d} Industrial Corp + "] CUR = 61 LEN = 63

Using utf8::decode on the string before storing into Postgres actually solves the issue. So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?


In reply to Re^2: How to interpret characters in Devel::Peek CUR by ait
in thread How to interpret characters in Devel::Peek CUR by ait

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.