in reply to Fine when written to text file, but unreadable when written to database table

It looks like the "ó" character is probably being stored in the database as a two-byte utf8 "wide" character. It turns out that when ó is encoded in utf8, the two-byte sequence is "\xC3\xB3"; you can look those up in a Latin1 chart and see that if these two bytes were treated as separate characters, they would come out as "Ã" and "superscript 3".

So somewhere in your setup, you are storing strings in utf8, and then somewhere else, you are treating them as if they were not utf8, but rather some single-byte encoding such as iso-8859-1 or cp1252.

You haven't given us enough information to tell where the problem is. Maybe the utf8 string is contained in the web page that you fetch, and is being stored in the database as the two-byte sequence. When you read that back from the database, the two-byte utf8 character might be getting displayed "as-is" on a 8859-1 or cp1252 display, or it could be that the two bytes are each being "upgraded" to utf8 characters and you're seeing à and superscript-3 on a utf8 display.

Whatever the problem, you just need to be explicit about what encoding is being used at each step of your process, and maybe do some encoding "conversions" at appropriate points.

If the database contains utf8 strings, and you use a Perl script to read stuff back from the database, Perl probably won't be able to know automatically that the string contains utf8 "wide" characters, and you'll need to use Encode to make that explicit:

use Encode; # assume the $string contains a value fetched from the database: $string = decode( "utf8", $string ); # sets the "utf8-flag" on $strin +g;
If that doesn't help, and you can't figure out what really needs to be done, you'll need to give us more information: What OS are you using, and are you using a utf8-based locale? What are you using to view the text data? Can you confirm whether the string is being stored in the database as utf8?
  • Comment on Re: Fine when written to text file, but unreadable when written to database table
  • Download Code

Replies are listed 'Best First'.
Re^2: Fine when written to text file, but unreadable when written to database table
by Kanishka (Beadle) on Oct 16, 2006 at 07:59 UTC
    I use windows 2000 as my platform.

    I use MSSQL.

    THe collation used there is 'SQL_Latin1_General_CP850_CI_AI'

    The downloaded file is an XML file. It uses 'encoding="UTF-8'
      In that case, you should either convert the strings from utf8 to cp850 before you store them to the database, or else you should convert them after fetching them back from the database, before you print them to a file or display them. (See the "from_to" function in Encode.)

      Whatever you do, make sure the database content always has the same encoding for all text data. Mixing different encodings into a single database would be as bad as mixing them in a single paragraph -- it becomes impossible (or at least terribly difficult) to make the data coherent.

      (It is possible to have a table with different fields using different encodings; you could even have pairs of fields, like "name" and "name_encoding" so that the encoding of "name" is specified for each row, but that's more trouble than it's worth. Keep it simple.)