Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks.

I've used perl in the past and know my way around a bit. But now I've encountered something I don't comprehend any more. Either I am doing something wrong or its a bug. I would apreciate some enlightenment on this example.

I am running Debian Jessie. German UTF-8 Locale. The problem seems to be at least similar on Debian Wheezy. I upgraded before writing this test.

This is perl 5, version 20, subversion 2 (v5.20.2) built for i586-linu +x-gnu-thread-multi-64int (with 42 registered patches, see perl -V for more detail)

My code is as follows:

#!/usr/bin/perl use DBI; use strict; use warnings; my $dbh; sub onevaluesql($) { my ($command) = @_; my $sth=$dbh->prepare($command); $sth->execute(); my ($value) = $sth->fetchrow_array(); $sth->finish(); return $value; } sub test($$) { my ($testname,$test) = @_; $dbh->do("insert into test(name,test) values (?,?);",undef,$testname +,$test); my $resulttest = onevaluesql("select test from test where name='$testname';"); if($test eq $resulttest) { print "$testname is successful on DBI readback.\n"; } else { print "$testname failed on DBI readback. Got $resulttest where $te +st was expected.\n"; } $dbh->commit(); my $resulttest2 = `echo "select test from test where name='$testname +';" | psql --html`; print "psql raw results: $resulttest2\n"; if($resulttest2 =~ s/<td align="left">([^<]*)</td>/) { $resulttest2=$1; } if($test eq $resulttest2) { print "$testname is successful on psql readback.\n"; } else { print "$testname failed on psql readback. Got $resulttest where $t +est was expected.\n"; } } $dbh = DBI->connect("dbi:Pg:dbname=articles", "", "", {AutoCommit => 0 +, RaiseError => 1}); print "DBI is version $DBI::VERSION, DBD::Pg is version $DBD::Pg::VERS +ION\n"; print "client_encoding=" . onevaluesql("show client_encoding;") . ", s +erver_encoding=" . onevaluesql("show server_encoding;") . "\n"; system "locale"; print "-------------------\n"; $dbh->do("drop table if exists test;"); $dbh->do("create table test ( name text, test text );"); $dbh->commit(); test('test1','ascii'); test('test2','äöüßÄÖÜ'); $dbh->disconnect();
My output here is:
DBI is version 1.631, DBD::Pg is version 3.4.2 client_encoding=UTF8, server_encoding=UTF8 LANG=de_DE.UTF-8 LANGUAGE= LC_CTYPE="de_DE.UTF-8" LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_COLLATE="de_DE.UTF-8" LC_MONETARY="de_DE.UTF-8" LC_MESSAGES="de_DE.UTF-8" LC_PAPER="de_DE.UTF-8" LC_NAME="de_DE.UTF-8" LC_ADDRESS="de_DE.UTF-8" LC_TELEPHONE="de_DE.UTF-8" LC_MEASUREMENT="de_DE.UTF-8" LC_IDENTIFICATION="de_DE.UTF-8" LC_ALL= ------------------- test1 is successful on DBI readback. psql raw results: <table border="1"> <tr> <th align="center">test</th> </tr> <tr valign="top"> <td align="left">ascii</td> </tr> </table> <p>(1 Zeile)<br /> </p> test1 is successful on psql readback. test2 is successful on DBI readback. psql raw results: <table border="1"> <tr> <th align="center">test</th> </tr> <tr valign="top"> <td align="left">äöüÃÃÃÃ</td> </tr> </table> <p>(1 Zeile)<br /> </p> test2 failed on psql readback. Got äöüßÄÖÜ where äöüßÄÖÜ was expected.

Can you monks tell me where i went wrong? What is happing here? Somehow the String is double utf-8 encoded in the database and perl partly seems to mask that after regexp operations? My goal is to insert utf-8 encoded strings into postgres and the strings should be the same there. In theorie that shouldn't be too hard, i keep failing at that...

Replies are listed 'Best First'.
Re: Oddities in UTF-8 and DBD::Pg?
by afoken (Chancellor) on Jun 28, 2015 at 07:32 UTC

    Have a look at the test programs 40UnicodeRoundTrip.t and 41Unicode.t of DBD::ODBC, and especially at the dumpstr() function in the helper module UChelp.pm. Most likely the strings you pass to DBI are not Unicode strings, but byte sequences representing UTF-8, which are interpreted as legacy character strings.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Oddities in UTF-8 and DBD::Pg?
by Anonymous Monk on Jun 28, 2015 at 06:18 UTC
    Hmmm, what happens when you add use utf8; to your program?
      Okay, I changed 2 things. use utf8; at the start. And i fixed the error in line 36:
      print "$testname failed on psql readback. Got $resulttest2 where $test + was expected.\n";
      Now I get this, which I must admit, looks better:
      ... test2 is successful on DBI readback. psql raw results: <table border="1"> <tr> <th align="center">test</th> </tr> <tr valign="top"> <td align="left">äöüßÄÖÜ</td> </tr> </table> <p>(1 Zeile)<br /> </p> test2 failed on psql readback. Got äöüßÄÖÜ where &#65533;&#65533;&#655 +33;&#65533;&#65533;&#65533;&#65533; was expected.
      I don't understand why I get the question marks in the variable test, but my problem seems to be fixed. Many thanks. I was missing the obvious it seems. ;o)
        I don't understand why I get the question marks in the variable test, but my problem seems to be fixed.
        The output of backticks should also be decoded before comparing with unicode strings. I think use open qw( :encoding(utf-8) :std ); does that, and also switches stdout to utf-8 mode (stdout being in binary mode is probably what causes 'question marks').
Re: Oddities in UTF-8 and DBD::Pg?
by Anonymous Monk on Jun 28, 2015 at 09:18 UTC