boboson has asked for the wisdom of the Perl Monks concerning the following question:

before I go mad, I'll try to post a question here:
The Swedish characters from my mysql database doesn't display correctly on my webpage.

I use a MySQL 5.0.32 database with the following perl modules: CGI::Application, DBI, HTML::Template etc.

The servers character set is latin1. I have the following line in my .htacces file:

AddCharset UTF-8 html

This line doesn't do much more than set my html files to utf-8, my templatefiles are with the extension phtml, and these files are outputed through index.cgi

I have set the meta tag in my template files to:

<?xml version="1.0" encoding="utf-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w +3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="se" xml:lang="se"> <head> <title></title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" + />

In my CGI::Application BaseClass i have the following code:

use utf8; # to state that the script itself is in utf8 binmode STDIN, ":encoding(utf8)"; binmode STDOUT, ":encoding(utf8)"; use as_utf8; # which is the hack posted at http://www.perlmonks.org/?n +ode_id=651574

My connection to the database looks like this:

# Connect to the database my $dbh = DBI->connect($dsn, $user, $password, { RaiseError => 1, Auto +Commit => 0, mysql_enable_utf8 => 1 }) or die "Can't connect: ", $DBI::errstr; my $sql = qq{SET NAMES 'utf8' COLLATE 'utf8_swedish_ci';}; $dbh->do($sql);

SHOW VARIABLES LIKE 'c%' gives me the following output: Variable_name Value character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results utf8 character_set_server latin1 character_set_system utf8 character_sets_dir /usr/share/mysql/charsets/ collation_connection utf8_unicode_ci collation_database utf8_swedish_ci collation_server latin1_swedish_ci completion_type 0 concurrent_insert 1 connect_timeout 5

When I try to output the following text:

Törjebjöåärne Ålandssäöna

which was inserted from my perl code, I'll get the following output on my webpage:

Törjebjöåärne Ã&#133;landssäöna

If I decode the mysql output:

use Encode; decode_utf8($tmpl->output);

The mysql data will show correctly, but my template files will show with the reverse questionmark sign.

I guess I could decode the data from my database and it would all work, but not all data should be decoded, just the varchar and text fields.

What is the correct way of doing this?
What am I missing?

Replies are listed 'Best First'.
Re: UTF-8 webpage output from MySQL
by moritz (Cardinal) on Jan 22, 2008 at 12:09 UTC
    The normal workflow is:
    1. Decode all incoming data (with Encode::decode or with an IO layer)
    2. Work with your data
    3. Encode all outgoing data (with Encode::encode or an IO layer)

    The problem is that HTML::Template doesn't support step one - it always reads templates as binary data. Once you mix that with decoded data, you're lost.

    One solution is to use HTML::Template::Compiled, which is a drop-in replacement for HTML::Template, and which has the open_mode option to new - just create your templates with

    use HTML::Template::Compiled; my $t = HTML::Template::Compiled->new( filename => 'mytemplate.phtml', open_mode => '<:encoding(UTF-8)', );

    Another "solution" is to encode every string that is passed to HTML::Template, but this will make your code explode (in terms of size, anyway).

    Update: a few debugging tips when dealing with charset issues:

    • In firefox press Ctrl+I to see the encoding that is actually used to display the page
    • Check your error.log (or STDERR) for "wide character in print"-warnings - it means that you forgot an IO-Layer or a encode_utf8 somewhere.
    • Use Devel::Peek to inspect your data - watch out for that UTF-8 flag

    2nd update: you might have confused "encode" and "decode" - you have to decode input data from the outside (Foreign data -> Perl text strings) and you have to encode data in the other direction (Perl text strings -> Rest of World).

      I thought these lines in my CGI::Application baseclass took care of the input, output encoding
      binmode STDIN, ":encoding(utf8)"; binmode STDOUT, ":encoding(utf8)";
      and that my main problem was the data from the database.
        In CGI scripts STDIN is only used to read POST data, so that's not all that interesting.

        But the problems with the templates remain - as long as you use HTML::Template, you'll have to be very careful not to mix binary and text strings. So if you don't want to waste your sanity on charset issues, you should really switch to a template system that is aware to character encodings.

        And since your templates have HTML::Template syntax I recommend one of the drop-in replacements, that is HTML::Template::Compiled or Template::Alloy.

        The line decode_utf8($tmpl->output); in the OP demonstrates that you decode the template's output. So if HTML::Template provides you with binary data, and DBI returns upgraded data (aka text strings), your problem actually occured much earlier.

Re: UTF-8 webpage output from MySQL
by Joost (Canon) on Jan 22, 2008 at 13:22 UTC
    I don't know anything about HTML::Template with utf8, but as for mysql:

    Make sure you've got the latest DBD::MySQL (or at least 4.004). Use the mysql_enable_utf8 option. Make sure all your utf8 columns really have an utf8 type.

    Use place holders/bind vars and make sure all text inserted on utf8 columns really is utf8. You can use utf8::upgrade for that.

    Selects should work out of the box for text and binary.

Re: UTF-8 webpage output from MySQL
by KurtSchwind (Chaplain) on Jan 22, 2008 at 14:02 UTC

    I'm not writing this in answer to your question, as it looks like others have beat me to the punch. No, this comment is about how you posed your question.

    You did a GREAT job of giving all the relevant information and being very precise and concise with your post. Kudos to you. This question could be a poster child for How To Ask A Question.

    --
    I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
      thanks alot!
Re: UTF-8 webpage output from MySQL
by Anonymous Monk on Jan 22, 2008 at 23:22 UTC

    I better take this question all the way to the end. After reading the comments I've decided to try the Template::Alloy module. I am using CGI::Applications load_tmpl function to load my templates with HTML::Template. Here is an example of a basic function with the call to load_tmpl:

    sub showMemberStart { # application object my $self = shift; # get cgi query object my $q = $self->query(); my %post = $q->Vars; # get session object my $session = $self->session; # database handle my $dbh = $self->param('dbh'); # load template files my $tmpl = $self->param('member_start' => $self->load_tmpl('member_sta +rt.phtml', die_on_bad_params => 0)); # output return $tmpl->output; }

    How would I go about implementing Template::Alloy into this code? I rather not change the following rows:

    # load template files my $tmpl = $self->param('member_start' => $self->load_tmpl('member_sta +rt.phtml', die_on_bad_params => 0));

    Should I override the load_tmpl function? Anybody have an idea how this would be done?

      I think overriding load_tmpl is a good idea.

      Look in the source code what that method does, and in your version substitute the call to HTML::Template->new(...) by a call to Template::Alloy->new(..., ENCODING => 'UTF-8').

        Ok, I did an override of the load_tmpl function like this:

        sub load_tmpl { my $self = shift; my ($tmpl_file, @extra_params) = @_; # add tmpl_path to path array if one is set, otherwise add a path +arg if (my $tmpl_path = $self->tmpl_path) { my @tmpl_paths = (ref $tmpl_path eq 'ARRAY') ? @$tmpl_path : $ +tmpl_path; my $found = 0; for( my $x = 0; $x < @extra_params; $x += 2 ) { if ($extra_params[$x] eq 'path' and ref $extra_params[$x+1] eq 'ARRAY') { unshift @{$extra_params[$x+1]}, @tmpl_paths; $found = 1; last; } } push(@extra_params, path => [ @tmpl_paths ]) unless $found; } my %tmpl_params = (); my %ht_params = @extra_params; %ht_params = () unless keys %ht_params; # Define our extension if doesn't already exist; $self->{__CURRENT_TMPL_EXTENSION} = '.html' unless defined $self-> +{__CURRENT_TMPL_EXTENSION}; # Define a default templat name based on the current run mode unless (defined $tmpl_file) { $tmpl_file = $self->get_current_runmode . $self->{__CURRENT_TM +PL_EXTENSION}; } $self->call_hook('load_tmpl', \%ht_params, \%tmpl_params, $tmpl_fi +le); #require HTML::Template; # let's check $tmpl_file and see what kind of parameter it is - we # now support 3 options: scalar (filename), ref to scalar (the # actual html/template content) and reference to FILEHANDLE #my $t = undef; #if ( ref $tmpl_file eq 'SCALAR' ) { # $t = HTML::Template->new_scalar_ref( $tmpl_file, %ht_params ) +; #} elsif ( ref $tmpl_file eq 'GLOB' ) { # $t = HTML::Template->new_filehandle( $tmpl_file, %ht_params ) +; #} else { # $t = HTML::Template->new_file($tmpl_file, %ht_params); #} require Template::Alloy; my $t = undef; if ( ref $tmpl_file eq 'SCALAR' ) { $t = Template::Alloy->new(type => 'filename', source => + $tmpl_file, %ht_params, ENCODING => 'UTF-8'); } elsif ( ref $tmpl_file eq 'GLOB' ) { $t = Template::Alloy->new(type => 'scalarref', source => $t +mpl_file, %ht_params, ENCODING => 'UTF-8'); } else { $t = Template::Alloy->new(type => 'filename', source => $tmpl_ +file, %ht_params, ENCODING => 'UTF-8'); } if (keys %tmpl_params) { $t->param(%tmpl_params); } return $t; }

        It works, but I still get strange characters. My template files displays the strange question mark symbol (firefox), square emtpy box (IE) instead of å, ä and ö.

        &#65533;ndra ditt l&#65533;senord

        The data from my database displays as before

        Törjebjöåärne

        I have tried to save my template files as UTF-8 without BOM, without any success. Am I doing some kind of double encoding? Why does it give me different characters, sometimes an ö will give me &#65533; and sometimes ö