in reply to Re: How to interpret characters in Devel::Peek CUR
in thread How to interpret characters in Devel::Peek CUR

Thanks a lot for this detailed answer!

I read the doc you recommended and although I knew some of the stuff in there, it is definitely a great read and clarifies part of the untold story. It also helped me understand your answer better, for example:

the fact that they're stored internally as UTF-8 should be considered an implementation detail

I think found what seems to be the root cause of the issue:

We are pulling data from an SQL Server database that is encoded in CP-1252 and we are using the DBI with the MS ODCB Driver for Linux version 13. It seems they are inserting UTF-8 data into that SQL Server, so when we get the data back in Perl the UTF-8 flag is not set (even though some records actually contain UTF-8 characters).

When we insert that data into our UTF-8 PostgreSQL debase, it seems to get double encoded. Also, some of these flawed records have a null terminator at the end too, which doesn't seem to affect the utf8 flag but it does mess up our trimming (The SQL Server char strings are padded with whitespace).

Data from SQL Server

SV = PV(0x560c8bacdf90) at 0x560c8b9b7998 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x560c8bbfad00 "Triple \342\200\234S\342\200\235 Industrial Cor +p \0"\0 CUR = 51 LEN = 53 COW_REFCNT = 1

Data after being Stored in Postgres (and retrieved)

SV = PV(0x560c8bacdec0) at 0x560c8bb7c7b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x560c8bbf38f0 "RAW: Triple \303\242\302\200\302\234S\303\242\3 +02\200\302\235 Industrial Corp "\0 [UTF8 "RAW: Tri +ple \x{e2}\x{80}\x{9c}S\x{e2}\x{80}\x{9d} Industrial Corp + "] CUR = 61 LEN = 63

Using utf8::decode on the string before storing into Postgres actually solves the issue. So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?

Replies are listed 'Best First'.
Re^3: How to interpret characters in Devel::Peek CUR
by Tux (Canon) on Jun 11, 2020 at 12:52 UTC

    As I previously commented here, much depends on the way you set up the connection, and there is still room to play with server-side encodings:

    I recently worked from perl on Linux with a MS SQL server database, and got the best results with FreeTDS:

    my $dbh = DBI->connect ("dbi:ODBC:mssql_freetds", $username, $password +, \%dbi_attributes);
    $ cat ~/.odbc.ini [mssql_freetds] Description = My MS SQL database Driver = FreeTDS TDS version = 7.2 Trace = No Server = mysql.server.local Port = 1433 Database = DatabaseName User = UserName Password = PassWord Client Charset = UTF-8

    The biggest difference between FreeTDS and the MS ODBC driver is the return type of UUID field. The MS ODBC does not allow nested queries, whereas the FreeTDS driver does. So I used the ODBC driver to make a CSV dump of the database and the FreeTDS driver to actually work with the database.

    For ODBC I did

    my $dbh = DBI->connect ("dbi:ODBC:mssql_odbc", $username, $password, \ +%dbi_attributes);
    $ cat ~/.odbc.ini [mssql_odbc] Description = My MS SQL database Driver = ODBC Driver 17 for SQL Server Server = mysql.server.local Database = DatabaseName User = UserName Password = PassWord

    Also make sure you put the fully qualified hostname in the server name. localhost will not work.


    Enjoy, Have FUN! H.Merijn

      Thanks Tux for your detailed response on the drivers ! After reading your response I am regretting not having looked at FreeTDS. We have to open a bunch of separate connections for sub-queries and have had all sorts of issues and quirks on truncation, and not sure if the stupid ODBC driver is doing some strange padding, etc. Anyway, we are almost done with this work and too late to change driver, but I wish I would have asked here an opinion on DBI and SQLServer before engaging this project :-(

Re^3: How to interpret characters in Devel::Peek CUR
by haukex (Archbishop) on Jun 11, 2020 at 10:03 UTC
    So knowing that the SQLServer end (which is out of our control) has UTF-8 data in a CP-1252 database would it be so wrong to force utf8::decode on all strings? Or is there a better way to deal with this?

    It's too bad that the server is out of your control, since that seems to be the source of the problem. But anyway, yes, I think fixing the issue as early as possible - as you pull the data off the server - is the "best" (relatively) way to go about it. Two things to keep in mind: Make sure that all the data really is UTF-8, and check the return value of utf8::decode(), because if that fails, then there's definitely something wrong with the encoding. But keep in mind that false negatives (e.g. data that is actually CP-1252 but also decodes as UTF-8) are possible, though somewhat unlikely.