in reply to Re^4: UTF-8 problem, some chars appear as \x..
in thread UTF-8 problem, some chars appear as \x..

There are 1-2 characters(different) that appear as \x.. (where .. - bytes) on some (not all) pages. E.g. "Гла\xD0\xB2ная", "Се\xD1\x80вис" rather than "Главная", "Сервис"(in Russian).

This happens either to data stored in MySQL or TT templates. Even when none CGI parameters has been passed and on pages which do not display data from the database. On some pages the same words (from same sources) come out correctly.

Hrm. I think that helps. My first guess would be that there's a problem with buffering. The pairs of "\x.." characters you cited turn out to be the utf8 sequences for the intended Russian characters ("\xD0\xB2" is the utf8 sequence for Cyrillic "в", and "\xD1\x80" is the sequence for Cyrillic "р"). (updated to paste the correct Cyrillic character for "\xD0\xB2")

So the problem would seem to be that, when your text is conveyed to the browser for display, there's some sort of interruption in the stream that causes the browser to treat a seemingly random utf8 character (byte pair) as if it were two unconnected bytes. But I don't see any clue yet about where that interruption is happening.

Since the problem appears even on data that does not come from MySQL, you can probably assume that the database and DBI are not involved.

You might start by setting the "$|" variable to non-zero, to turn off output buffering, and see if that makes a difference. If it doesn't, you'll need to designate a particular cgi request that always returns the same stream of web-page text; hopefully, when you run that request and see corrupted characters, they will always be at the same point in the page every time you repeat the request. (If they aren't, it gets harder to figure out what's going on.)

Next, try command-line usage of the script, so you can redirect its STDOUT to a file and study the data more carefully (update: it might be sufficient just to save the web page data to a file using your browser -- in fact, this might be more informative). Maybe some "invisible" byte is being thrown in somehow, and splitting up the two bytes of a utf8 character, which is the sort of thing you can only work out from a hex dump or similarly careful view of the data.

Sorry I couldn't be more helpful -- this sort of symptom is a new one to me, and I wouldn't have expected it to be possible...

  • Comment on Re^5: UTF-8 problem, some chars appear as \x..

Replies are listed 'Best First'.
Re^6: UTF-8 problem, some chars appear as \x..
by zanzibar (Novice) on Feb 20, 2007 at 10:28 UTC

    Setting "$|" to "1" at the very beginning of the script didn't help. However, it fixed some characters but broke other characters. So it makes difference.

    If I run the same CGI request multiple times, I get same broken characters at same places, i.e. same stream.

    I have reviewed the stream with hex editor. It starts with "FF FE".

    Here is something interesting.
      Г л а в н а я              
    Normal: 00 13 04 3B 04 30 04 32 04 3D 04 30 04 4F             
    Broken: 00 13 04 3B 04 30 04 32 04 5C 00 78 00 44 00 30 00 5C 00 78 00 42 00 44 00 3004 4F
      Г л а в \ x D 0 \ x B D а я

    I really can not explain this.