robv has asked for the wisdom of the Perl Monks concerning the following question:

As the sample code below shows, I can assign two (non-ASCII/Latin-1) character string values that are supposed to be the same, and the two versions are stored differently in my database. Other experiments suggest that reading the data is not the problem, it's INSERT-ing or UPDATE-ing the data where the problem arises. Can anyone shed some light on this one (or point me to a better source?)

Version info: I'm using ActiveState Perl, version 5.008008. The operating system is Windows XP Pro, SP2. The database is MS SQL Server 2000, SP3. The DBI version is 1.50. The DBD-ODBC version is 1.13. The code below runs stand-alone, e.g., not in a Web server.

(By the way, changing the column type from varchar to ncharvar doesn't affect the results.)

Here's the code. Be warned that the code contains some UTF-8 characters that are not Latin-1: If you want to test the program, just change the DBI->connect line.

use strict; use DBI; use charnames ':full'; use vars qw($trace); print "Perl version is $]\n"; # Output is "Perl version is 5.008008" my $text1 = 'Priorit' . "\N{LATIN SMALL LETTER A WITH DIAERESIS}" . 't'; my $text2 = 'Priorität'; if ($text1 eq $text2) {print 'Equal '}; if ($text1 ne $text2) {print 'Different '}; print length($text1) . ' ' . length($text2) . "\n"; # Output is "Equal 9 9" # Connect to the database and create a database handler my $dbh = DBI->connect('dbi:ODBC:dataSource', 'userID', 'password', { RaiseError => 0, AutoCommit => 0 } ) || die $DBI::errstr; do_sql(<<'--------'); CREATE TABLE dummy1 ( id1 int NOT NULL, str1 varchar(50) NOT NULL ) ON [PRIMARY] -------- do_sql("INSERT INTO dummy1 (id1, str1) VALUES(1, '$text1')"); do_sql("INSERT INTO dummy1 (id1, str1) VALUES(2, '$text2')"); # According to the SQL Query Analyzer program, the table contains: # 1, 'Priorität' (a 10 character value) # 2, 'Priorität' (a 9 character value) my $array_ref = $dbh->selectall_arrayref('SELECT * FROM dummy1'); for my $r (@$array_ref) { print ${$r}[0] . ': ' . length(${$r}[1]) . "\n"; } # Output is # 1: 10 # 2: 9 $dbh->commit; $dbh->disconnect || warn $dbh->errstr; sub do_sql { $dbh->do($_[0]); if ($dbh->errstr()) { print STDERR 'error number = ' . $dbh->err() . "\n"; print STDERR "The following SQL statement failed:\n$_[0]\n"; $dbh->disconnect || warn $dbh->errstr; die; } }

Replies are listed 'Best First'.
Re: database stores UTF8 strings inconsistently
by borisz (Canon) on Jun 22, 2006 at 21:12 UTC
    your utf8 string loose the utf8 flag. After you get it back from the database. You have to tell that the data from the db is in utf8.
    my $same_with_utf8_on = Encode::decode("utf8", $from_db);
    Or if you are sure, use
    Encode::_utf8_on($from_db);
    Boris
Re: database stores UTF8 strings inconsistently
by rhesa (Vicar) on Jun 22, 2006 at 21:31 UTC
    borisz is right, DBD::ODBC gives you the raw bytes, and doesn't turn on perl's internal utf8 flag. His answer is perfectly correct too, but in practice I've chosen to drop DBD::ODBC in favor of DBD::ADO. It does a much better job of handling unicode text. I can't tell you why, but that's my experience.

    But I should add that I needed this snippet at the top of my code to get full utf8 support:

    # for ADO usage use Win32::OLE; Win32::OLE->Option( CP => Win32::OLE::CP_UTF8 );
Re: database stores UTF8 strings inconsistently
by graff (Chancellor) on Jun 22, 2006 at 23:28 UTC
    As the sample code below shows, I can assign two (non-ASCII/Latin-1) character string values that are supposed to be the same, and the two versions are stored differently in my database.

    Based on what you have posted, I would expect that the two strings you are using are not the same: $text1 is being assigned a utf8 string value, which contains a wide character (this is stored internally as a two-byte utf8 character); but $text2 is being assigned an iso-8859 string containing a single-byte accented character.

    At least, when I look closely at the posted code, the value assigned to $text2 contains an accented character that is definitely a single byte and cannot be utf8. If you want to put literal utf8 characters in your perl script, you have to use a utf8-capable editor. Otherwise, you have to stick to using the unicode name references (like you did for $text1), or hex code points (e.g. "\xE4" for ä or "\x0103" for ă etc). update: Or you could use a non-utf8 editor, then run the script though an encoding conversion to change the iso-8859 (or cp-1252?) accented characters to utf8 wide characters.

    So you need to check and make sure that the stuff you are loading into the table is in fact encoded in a consistent manner -- if you put different encodings in, then you will obviously get different encodings back, and strings that are supposed to have the same letters will be different.

    The database ought to be agnostic as to character encoding -- you give it a string of bytes, it stores them, and you get them back when you ask for them.

    As for making sure that you have consistent encoding for all the stuff you feed to the database, I don't think you've told us enough about the problem to give an idea of how hard or easy this might be. Where is the character data coming from? (How many different sources? String literals in your script? Data files from "outsiders"? ...)

      Thanks. Your reply pointed me to an error in my thinking: I assumed that if a file contained accented characters and Japanese yen symbols that it had to be in UTF-8, not realizing that ISO-8859-1 could also support those.

      What I'm really trying to do is copy tables from a production database into XML (using DBIx) and load it into a test DB using XML::Parser. XML::Parser, however, dies with an "invalid token" error message. Chasing this down is where the confusion arose.

      Thanks again.

Re: database stores UTF8 strings inconsistently
by Moron (Curate) on Jun 23, 2006 at 11:39 UTC
    Despite the very nice theories, the situation described in the OP just couldn't have happened that way. If 'eq' says the strings are equal, they will be equal -- it is not possible that the database can 'prove' otherwise. I suggest that something has gone wrong with the gathering of evidence.

    -M

    Free your mind