UTF-8 problem, some chars appear as \x..

zanzibar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: UTF-8 problem, some chars appear as \x.. by Joost (Canon) on Feb 18, 2007 at 22:23 UTC
Maybe you don't get the input as UTF8? how about showing some code. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: UTF-8 problem, some chars appear as \x.. by zanzibar (Novice) on Feb 19, 2007 at 00:22 UTC
I will try to rip some code to demonstrate the problem. One important note: There is no problem on Windows ActivePerl. It only happens on Linux. Here is how I handle input data: ################################### package main; ################################### sub decode_string { my $string = shift; Encode::_utf8_off($string); my $encoding = Encode::Guess->guess($string); if (!ref($encoding)) { $string = Encode::decode_utf8($string, 1); }else{ $string = $encoding->decode($string) if $encoding; } return $string; } CGI::param($_, map { decode_string($_) } (CGI::param($_)) ) foreach CG +I::param; ##################################### package My::DBI; ##################################### sub fixutfflag { my $self = shift; do { Encode::_utf8_on($_); } foreach @{$self}{$self->columns}; } ##################################### package My::DBI::Class; ##################################### __PACKAGE__->add_trigger( select => sub { my $self = shift; $self->fixutfflag; }); [download]	[reply] [d/l]
Re^3: UTF-8 problem, some chars appear as \x.. by graff (Chancellor) on Feb 19, 2007 at 07:33 UTC
I think you would need to be really perfectly confident about the quality and content of your data in order to be using the "_utf8_on" function the way you do. And in fact, I would almost never be that confident about any data. Stuff coming from a database does not give me much confidence at all. In any case, I tend to heed the warning in the Encode manual about the _utf8_off/on "internal" functions -- they are not intended to be part of the Encode API, and you shouldn't be using them at all. It would help a lot if you could provide a data sample, and/or describe the problem in the data in more detail: Do any wide (non-ASCII) characters come out correctly at all, or is it rather the case that the "1-2 broken letters per page" just happen to be all of the wide characters in the data? When you say you find "\x.." instead of a character, does that really mean exactly two hex digits after the "\x", and do those hex numbers make sense as (Latin1 or other non-unicode) single-byte codepoints for characters that you would expect to see (like é)? You say you "set utf-8 flag for CDBI data and decode all CGI parameters", but you didn't show the code where you actually try to do this. Based on the code that you have shown so far, I'd say there's some chance that you've got a misunderstanding somewhere. It may be that the database you are fetching from does not really have data in utf8 form, or your output file handle is not set for utf8 discipline, and for one or more reasons, a needed conversion is not really happening. BTW, when the cgi script sends stuff to the client browser, is the character encoding specified in the http header or in the html, and/or is the browser using the correct encoding when interpreting the data?	[reply]
Re^4: UTF-8 problem, some chars appear as \x.. by zanzibar (Novice) on Feb 19, 2007 at 10:50 UTC
Re^5: UTF-8 problem, some chars appear as \x.. by graff (Chancellor) on Feb 20, 2007 at 06:37 UTC
Some notes below your chosen depth have not been shown here
Re^5: UTF-8 problem, some chars appear as \x.. by ikegami (Patriarch) on Feb 20, 2007 at 07:18 UTC
Some notes below your chosen depth have not been shown here