in reply to Re: How to interpret characters in Devel::Peek CUR
in thread How to interpret characters in Devel::Peek CUR
Thanks a lot for this detailed answer!
I read the doc you recommended and although I knew some of the stuff in there, it is definitely a great read and clarifies part of the untold story. It also helped me understand your answer better, for example:
the fact that they're stored internally as UTF-8 should be considered an implementation detail
I think found what seems to be the root cause of the issue:
We are pulling data from an SQL Server database that is encoded in CP-1252 and we are using the DBI with the MS ODCB Driver for Linux version 13. It seems they are inserting UTF-8 data into that SQL Server, so when we get the data back in Perl the UTF-8 flag is not set (even though some records actually contain UTF-8 characters).
When we insert that data into our UTF-8 PostgreSQL debase, it seems to get double encoded. Also, some of these flawed records have a null terminator at the end too, which doesn't seem to affect the utf8 flag but it does mess up our trimming (The SQL Server char strings are padded with whitespace).
Data from SQL Server
SV = PV(0x560c8bacdf90) at 0x560c8b9b7998 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x560c8bbfad00 "Triple \342\200\234S\342\200\235 Industrial Cor +p \0"\0 CUR = 51 LEN = 53 COW_REFCNT = 1
Data after being Stored in Postgres (and retrieved)
SV = PV(0x560c8bacdec0) at 0x560c8bb7c7b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x560c8bbf38f0 "RAW: Triple \303\242\302\200\302\234S\303\242\3 +02\200\302\235 Industrial Corp "\0 [UTF8 "RAW: Tri +ple \x{e2}\x{80}\x{9c}S\x{e2}\x{80}\x{9d} Industrial Corp + "] CUR = 61 LEN = 63
Using utf8::decode on the string before storing into Postgres actually solves the issue. So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: How to interpret characters in Devel::Peek CUR
by Tux (Canon) on Jun 11, 2020 at 12:52 UTC | |
by ait (Hermit) on Jun 16, 2020 at 15:13 UTC | |
Re^3: How to interpret characters in Devel::Peek CUR
by haukex (Archbishop) on Jun 11, 2020 at 10:03 UTC |