I will try to rip some code to demonstrate the problem.
One important note:
There is no problem on Windows ActivePerl. It only happens on Linux.
Here is how I handle input data:
###################################
package main;
###################################
sub decode_string {
my $string = shift;
Encode::_utf8_off($string);
my $encoding = Encode::Guess->guess($string);
if (!ref($encoding)) {
$string = Encode::decode_utf8($string, 1);
}else{
$string = $encoding->decode($string) if $encoding;
}
return $string;
}
CGI::param($_, map { decode_string($_) } (CGI::param($_)) ) foreach CG
+I::param;
#####################################
package My::DBI;
#####################################
sub fixutfflag {
my $self = shift;
do {
Encode::_utf8_on($_);
} foreach @{$self}{$self->columns};
}
#####################################
package My::DBI::Class;
#####################################
__PACKAGE__->add_trigger( select => sub {
my $self = shift;
$self->fixutfflag;
});
| [reply] [d/l] |
I think you would need to be really perfectly confident about the quality and content of your data in order to be using the "_utf8_on" function the way you do. And in fact, I would almost never be that confident about any data. Stuff coming from a database does not give me much confidence at all.
In any case, I tend to heed the warning in the Encode manual about the _utf8_off/on "internal" functions -- they are not intended to be part of the Encode API, and you shouldn't be using them at all.
It would help a lot if you could provide a data sample, and/or describe the problem in the data in more detail:
- Do any wide (non-ASCII) characters come out correctly at all, or is it rather the case that the "1-2 broken letters per page" just happen to be all of the wide characters in the data?
- When you say you find "\x.." instead of a character, does that really mean exactly two hex digits after the "\x", and do those hex numbers make sense as (Latin1 or other non-unicode) single-byte codepoints for characters that you would expect to see (like é)?
You say you "set utf-8 flag for CDBI data and decode all CGI parameters", but you didn't show the code where you actually try to do this. Based on the code that you have shown so far, I'd say there's some chance that you've got a misunderstanding somewhere. It may be that the database you are fetching from does not really have data in utf8 form, or your output file handle is not set for utf8 discipline, and for one or more reasons, a needed conversion is not really happening.
BTW, when the cgi script sends stuff to the client browser, is the character encoding specified in the http header or in the html, and/or is the browser using the correct encoding when interpreting the data?
| [reply] |
Thank you for your reply.
Sorry for not being clear with the problem.
99.9% of all non-ASCII characters come out correctly.
There are 1-2 characters(different) that appear as \x.. (where .. - bytes) on some (not all) pages. E.g. "Гла\xD0\xB2ная", "Се\xD1\x80вис" rather than "Главная", "Сервис"(in Russian).
This happens either to data stored in MySQL or TT templates. Even when none CGI parameters has been passed and on pages which do not display data from the database. On some pages the same words (from same sources) come out correctly.
All MySQL tables have utf8 charset and collation. I do 'set names utf8 collate utf8_general_ci' upon connection. AFAIK, MySQL returns correct UTF-8 data but doesn't set UTF-8 flag. What are other options to fix this?
For the client I set the UTF-8 encoding in the HTML files and send it along with HTTP header. Browser uses UTF-8 to display the data.
I have tried to play with STDOUT
binmode STDOUT => ':raw';
binmode STDOUT => ':utf8';
binmode STDOUT => ':encoding(utf8)';
Somehow, it fixes the characters on some pages but it breaks on other pages.
| [reply] [d/l] |