in reply to Re^3: UTF-8 problem, some chars appear as \x..
in thread UTF-8 problem, some chars appear as \x..

Thank you for your reply.

Sorry for not being clear with the problem.

99.9% of all non-ASCII characters come out correctly.

There are 1-2 characters(different) that appear as \x.. (where .. - bytes) on some (not all) pages. E.g. "Гла\xD0\xB2ная", "Се\xD1\x80вис" rather than "Главная", "Сервис"(in Russian).

This happens either to data stored in MySQL or TT templates. Even when none CGI parameters has been passed and on pages which do not display data from the database. On some pages the same words (from same sources) come out correctly.

All MySQL tables have utf8 charset and collation. I do 'set names utf8 collate utf8_general_ci' upon connection. AFAIK, MySQL returns correct UTF-8 data but doesn't set UTF-8 flag. What are other options to fix this?

For the client I set the UTF-8 encoding in the HTML files and send it along with HTTP header. Browser uses UTF-8 to display the data.

I have tried to play with STDOUT

binmode STDOUT => ':raw'; binmode STDOUT => ':utf8'; binmode STDOUT => ':encoding(utf8)';

Somehow, it fixes the characters on some pages but it breaks on other pages.

Replies are listed 'Best First'.
Re^5: UTF-8 problem, some chars appear as \x..
by graff (Chancellor) on Feb 20, 2007 at 06:37 UTC
    There are 1-2 characters(different) that appear as \x.. (where .. - bytes) on some (not all) pages. E.g. "Гла\xD0\xB2ная", "Се\xD1\x80вис" rather than "Главная", "Сервис"(in Russian).

    This happens either to data stored in MySQL or TT templates. Even when none CGI parameters has been passed and on pages which do not display data from the database. On some pages the same words (from same sources) come out correctly.

    Hrm. I think that helps. My first guess would be that there's a problem with buffering. The pairs of "\x.." characters you cited turn out to be the utf8 sequences for the intended Russian characters ("\xD0\xB2" is the utf8 sequence for Cyrillic "в", and "\xD1\x80" is the sequence for Cyrillic "р"). (updated to paste the correct Cyrillic character for "\xD0\xB2")

    So the problem would seem to be that, when your text is conveyed to the browser for display, there's some sort of interruption in the stream that causes the browser to treat a seemingly random utf8 character (byte pair) as if it were two unconnected bytes. But I don't see any clue yet about where that interruption is happening.

    Since the problem appears even on data that does not come from MySQL, you can probably assume that the database and DBI are not involved.

    You might start by setting the "$|" variable to non-zero, to turn off output buffering, and see if that makes a difference. If it doesn't, you'll need to designate a particular cgi request that always returns the same stream of web-page text; hopefully, when you run that request and see corrupted characters, they will always be at the same point in the page every time you repeat the request. (If they aren't, it gets harder to figure out what's going on.)

    Next, try command-line usage of the script, so you can redirect its STDOUT to a file and study the data more carefully (update: it might be sufficient just to save the web page data to a file using your browser -- in fact, this might be more informative). Maybe some "invisible" byte is being thrown in somehow, and splitting up the two bytes of a utf8 character, which is the sort of thing you can only work out from a hex dump or similarly careful view of the data.

    Sorry I couldn't be more helpful -- this sort of symptom is a new one to me, and I wouldn't have expected it to be possible...

      Setting "$|" to "1" at the very beginning of the script didn't help. However, it fixed some characters but broke other characters. So it makes difference.

      If I run the same CGI request multiple times, I get same broken characters at same places, i.e. same stream.

      I have reviewed the stream with hex editor. It starts with "FF FE".

      Here is something interesting.
        Г л а в н а я              
      Normal: 00 13 04 3B 04 30 04 32 04 3D 04 30 04 4F             
      Broken: 00 13 04 3B 04 30 04 32 04 5C 00 78 00 44 00 30 00 5C 00 78 00 42 00 44 00 3004 4F
        Г л а в \ x D 0 \ x B D а я

      I really can not explain this.

Re^5: UTF-8 problem, some chars appear as \x..
by ikegami (Patriarch) on Feb 20, 2007 at 07:18 UTC

    AFAIK, MySQL returns correct UTF-8 data but doesn't set UTF-8 flag. What are other options to fix this?

    Don't pass a string of char where a string of bytes is expected.

    my $octets_for_mysql = decode('UTF-8', $chars); # Option 1 my $octets_for_mysql = encode('utf8', $chars); # Option 2

    As for retrieving the data,

    my $chars = decode('UTF-8', $octets_from_mysql); # Option 1 my $chars = decode('utf8', $octets_from_mysql); # Option 2

    The difference between options 1 and 2 is explained in "UTF-8 vs. utf8" in Encode's docs.

      Is it possible to decode the data on a lower level? I mean somewhere between DBI and Class-DBI so that Class-DBI accessors provide already decoded data.

      In the code I have posted above I have changed Encode::_utf8_on($_) to $_ = Encode::decode_utf8($_) unless Encode::is_utf8($_) but I get Cannot decode string with wide characters at ..../Encode.pm line 162