in reply to database stores UTF8 strings inconsistently
Based on what you have posted, I would expect that the two strings you are using are not the same: $text1 is being assigned a utf8 string value, which contains a wide character (this is stored internally as a two-byte utf8 character); but $text2 is being assigned an iso-8859 string containing a single-byte accented character.
At least, when I look closely at the posted code, the value assigned to $text2 contains an accented character that is definitely a single byte and cannot be utf8. If you want to put literal utf8 characters in your perl script, you have to use a utf8-capable editor. Otherwise, you have to stick to using the unicode name references (like you did for $text1), or hex code points (e.g. "\xE4" for ä or "\x0103" for ă etc). update: Or you could use a non-utf8 editor, then run the script though an encoding conversion to change the iso-8859 (or cp-1252?) accented characters to utf8 wide characters.
So you need to check and make sure that the stuff you are loading into the table is in fact encoded in a consistent manner -- if you put different encodings in, then you will obviously get different encodings back, and strings that are supposed to have the same letters will be different.
The database ought to be agnostic as to character encoding -- you give it a string of bytes, it stores them, and you get them back when you ask for them.
As for making sure that you have consistent encoding for all the stuff you feed to the database, I don't think you've told us enough about the problem to give an idea of how hard or easy this might be. Where is the character data coming from? (How many different sources? String literals in your script? Data files from "outsiders"? ...)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: database stores UTF8 strings inconsistently
by robv (Novice) on Jun 23, 2006 at 15:19 UTC |