zanzibar has asked for the wisdom of the Perl Monks concerning the following question:

I have a CGI script. It uses CDBI + TT. Mostly all text outputs alright but some words have \x.. instead of some letters. Generally, there are 1-2 broken letters on a page. The script has "use encoding 'utf8'" and "use utf8" in all my modules. All scripts and files are in UTF-8 (without BOM). I set utf-8 flag for CDBI data and decode all CGI parameters. Where could be the problem? Thanks.
  • Comment on UTF-8 problem, some chars appear as \x..

Replies are listed 'Best First'.
Re: UTF-8 problem, some chars appear as \x..
by Joost (Canon) on Feb 18, 2007 at 22:23 UTC

      I will try to rip some code to demonstrate the problem.

      One important note: There is no problem on Windows ActivePerl. It only happens on Linux.

      Here is how I handle input data:

      ################################### package main; ################################### sub decode_string { my $string = shift; Encode::_utf8_off($string); my $encoding = Encode::Guess->guess($string); if (!ref($encoding)) { $string = Encode::decode_utf8($string, 1); }else{ $string = $encoding->decode($string) if $encoding; } return $string; } CGI::param($_, map { decode_string($_) } (CGI::param($_)) ) foreach CG +I::param; ##################################### package My::DBI; ##################################### sub fixutfflag { my $self = shift; do { Encode::_utf8_on($_); } foreach @{$self}{$self->columns}; } ##################################### package My::DBI::Class; ##################################### __PACKAGE__->add_trigger( select => sub { my $self = shift; $self->fixutfflag; });
        I think you would need to be really perfectly confident about the quality and content of your data in order to be using the "_utf8_on" function the way you do. And in fact, I would almost never be that confident about any data. Stuff coming from a database does not give me much confidence at all.

        In any case, I tend to heed the warning in the Encode manual about the _utf8_off/on "internal" functions -- they are not intended to be part of the Encode API, and you shouldn't be using them at all.

        It would help a lot if you could provide a data sample, and/or describe the problem in the data in more detail:

        • Do any wide (non-ASCII) characters come out correctly at all, or is it rather the case that the "1-2 broken letters per page" just happen to be all of the wide characters in the data?

        • When you say you find "\x.." instead of a character, does that really mean exactly two hex digits after the "\x", and do those hex numbers make sense as (Latin1 or other non-unicode) single-byte codepoints for characters that you would expect to see (like é)?

        You say you "set utf-8 flag for CDBI data and decode all CGI parameters", but you didn't show the code where you actually try to do this. Based on the code that you have shown so far, I'd say there's some chance that you've got a misunderstanding somewhere. It may be that the database you are fetching from does not really have data in utf8 form, or your output file handle is not set for utf8 discipline, and for one or more reasons, a needed conversion is not really happening.

        BTW, when the cgi script sends stuff to the client browser, is the character encoding specified in the http header or in the html, and/or is the browser using the correct encoding when interpreting the data?