in reply to Re^2: Inserting UTF-8 on Mysql using DBI
in thread Inserting UTF-8 on Mysql using DBI

So what did you do to cross check that you're actually receiving and sending utf8 to/from MySQL? What data do you get in Perl when you insert a row using PHP? What data do you get in PHP when you insert a row using Perl? Try to eliminate the mysql client as a potential source of confusion - it might not output utf8 properly (I say without knowing the mysql client well) - the question marks seem to me ignored/escaped unicode characters.

  • Comment on Re^3: Inserting UTF-8 on Mysql using DBI

Replies are listed 'Best First'.
Re^4: Inserting UTF-8 on Mysql using DBI
by Fox (Pilgrim) on Oct 16, 2010 at 12:48 UTC
    I found the problem, as it turns out, neither mysql client nor php were using utf8 after all, only perl.. ugh... one thing I still don't understand though is why the characters where being display right on the browser despite the fact the page always had the content-type utf8 header... I guess I understand charset encoding even less now..

      Browsers really like to make a "best effort" at guessing the content, even if they have to deviate from the Content-Type: text/html; charset=utf-8 header. Which is why eliminating all intermediaries and cross-checking all steps is the only approach I know that works.

      Smells like "the other" programs inserted UTF-8 byte streams that luckily came back unmodified from MySQL. So you could insert and fetch something that looked like UTF-8, even when MySQL converted the byte stream from what it thought to be ISO-8859-1 to broken UTF-8 while inserting, and back from broken UTF-8 to ISO-8859-1. A big hint for such things going wrong is that the strings have the wrong length in the database (one or two extra characters for each non-ASCII character). Have a look at the Unicode tests in DBD::ODBC, especially t/40UnicodeRoundTrip.t and t/41Unicode.t.

      The browser shows the correct characters because you told it explicitly to do so: There is a UTF-8 byte stream in the HTML resource delivered by the server, and the HTML resource (or its headers) says that it is encoded as UTF-8. It simply does not matter that the software generating the page accidentally or intentionally wrote that byte stream as what it thought to be ISO-8859-1 characters.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)