Re: database stores UTF8 strings inconsistently

As the sample code below shows, I can assign two (non-ASCII/Latin-1) character string values that are supposed to be the same, and the two versions are stored differently in my database.

Based on what you have posted, I would expect that the two strings you are using are not the same: $text1 is being assigned a utf8 string value, which contains a wide character (this is stored internally as a two-byte utf8 character); but $text2 is being assigned an iso-8859 string containing a single-byte accented character.

At least, when I look closely at the posted code, the value assigned to $text2 contains an accented character that is definitely a single byte and cannot be utf8. If you want to put literal utf8 characters in your perl script, you have to use a utf8-capable editor. Otherwise, you have to stick to using the unicode name references (like you did for $text1), or hex code points (e.g. "\xE4" for ä or "\x0103" for ă etc). update: Or you could use a non-utf8 editor, then run the script though an encoding conversion to change the iso-8859 (or cp-1252?) accented characters to utf8 wide characters.

So you need to check and make sure that the stuff you are loading into the table is in fact encoded in a consistent manner -- if you put different encodings in, then you will obviously get different encodings back, and strings that are supposed to have the same letters will be different.

The database ought to be agnostic as to character encoding -- you give it a string of bytes, it stores them, and you get them back when you ask for them.

As for making sure that you have consistent encoding for all the stuff you feed to the database, I don't think you've told us enough about the problem to give an idea of how hard or easy this might be. Where is the character data coming from? (How many different sources? String literals in your script? Data files from "outsiders"? ...)

Comment on Re: database stores UTF8 strings inconsistently

Replies are listed 'Best First'.
Re^2: database stores UTF8 strings inconsistently by robv (Novice) on Jun 23, 2006 at 15:19 UTC
Thanks. Your reply pointed me to an error in my thinking: I assumed that if a file contained accented characters and Japanese yen symbols that it had to be in UTF-8, not realizing that ISO-8859-1 could also support those. What I'm really trying to do is copy tables from a production database into XML (using DBIx) and load it into a test DB using XML::Parser. XML::Parser, however, dies with an "invalid token" error message. Chasing this down is where the confusion arose. Thanks again.	[reply]

Replies are listed 'Best First'.

Re^2: database stores UTF8 strings inconsistently
by robv (Novice) on Jun 23, 2006 at 15:19 UTC

Thanks. Your reply pointed me to an error in my thinking: I assumed that if a file contained accented characters and Japanese yen symbols that it had to be in UTF-8, not realizing that ISO-8859-1 could also support those.

What I'm really trying to do is copy tables from a production database into XML (using DBIx) and load it into a test DB using XML::Parser. XML::Parser, however, dies with an "invalid token" error message. Chasing this down is where the confusion arose.

Thanks again.

[reply]