boboson has asked for the wisdom of the Perl Monks concerning the following question:

I am totally lost on ideas.
I am using CGI::Application, HTML::Template, XML::Simple, DBI, Class::PhraseBook etc. to:
The text displayed in my webpage will come from various sources such as: Example files:
XML::Simple XML config file example
<?xml version="1.0"?> <config> <language>SE</language> <login_successRM>start</login_successRM> <logoutRM>start</logoutRM> <reg_missing>åäötest</reg_missing> </config>
Class::PhraseBook XML language file example
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE phrasebook [ <!ELEMENT phrasebook (dictionary)*> <!ELEMENT dictionary (phrase)*> <!ATTLIST dictionary name CDATA #REQUIRED> <!ELEMENT phrase (#PCDATA)> <!ATTLIST phrase name CDATA #REQUIRED> ]> <phrasebook> <dictionary name="SE"> <phrase name="1">startsida</phrase> <phrase name="2">live</phrase> <phrase name="3">musik</phrase> <phrase name="4">åäötest</phrase> </dictionary> <dictionary name="EN"> <phrase name="1">home</phrase> <phrase name="2">live</phrase> <phrase name="3">music</phrase> <phrase name="4">test</phrase> </dictionary> </phrasebook>
My problem comes with XML encoding. It encodes everything with utf-8. When I try to display my XML content from my config file and phrasebook file, the Swedish letters å, Å, ä, Ä and ö, Ö will not display correcty.
I got a suggestion that I could output everything in utf-8 by sending a header stating the charset:
$self->header_add( -type => 'text/html; charset=utf-8' );
That did the trick for my XML output, but text from the HTML::Template files and Database does not output correctly anymore. I could use the following codes for the Swedish letters å, ä and ö
&aring; &auml; &ouml;
but I can't rely that users of this webpage is going to use those codes for å, ä and ö everywhere.
What can I do? I have tried using Encode without any success.

Replies are listed 'Best First'.
Re: XML encoding problem
by mirod (Canon) on Mar 14, 2006 at 12:48 UTC

    I think your problems come from mixing encodings here. You should convert everything to utf-8 as early as possible in the process, and stick to it. Modern web browser should display the resulting HTML without any problem. Your terminal, if doesn't support utf-8 might not though. You should probably upgrade it if you can, otherwise you will have to live with seeing a few @ in your files. A utf-8-enabled editor would display everything properly.

    You should convert static files to utf-8 (you can use iconv), and if possible database contents too. If you can't convert the db content, use Encode to convert the strings before using them.

    This seems like a lot of work upfront, but I really believe it is worth it. Once it's done you will have a clean system to work with.

Re: XML encoding problem
by pajout (Curate) on Mar 14, 2006 at 12:42 UTC
    My opinion is that the best way is to code all xml-files in utf-8, and output html too :>)
    Some years ago I solved very similar job, so I used DTD's as dictionary. My convention was: If some file.xml have to be localized, it's locale-specific content must be realized by entities, defined in $locale/file.dtd . Consequently, I had en/file.dtd and cz/file.dtd and softlinks file.xml => ../file.xml in both locale specific directories. Finally, when I parsed cz/file.xml, I had czech xml, when I parsed en/file.xml, I had english xml. Just idea...
Re: XML encoding problem
by izut (Chaplain) on Mar 14, 2006 at 12:48 UTC

    Aren't those codes used even if a different charset is used? I think is safer using the codes or converting that swedish letters to their respective html code in runtime. I think HTML::Entities can help.

    Igor 'izut' Sutton
    your code, your rules.

Re: XML encoding problem
by wazoox (Prior) on Mar 14, 2006 at 17:40 UTC
    The problem comes from LibXML, wich works with utf8 only. Generally if you want to work with XML, you'd better stick to utf8 encoding. You don't need to rewrite all of your templates though; simply add something like this :
    use Unicode::String; Unicode::String->stringify_as('utf8'); $text = Unicode::String::latin1($text);
    to convert dynamically your iso-8859-1/15 strings to utf8.
      I am not sure what I will do, but this gave me something to work with.
      I created a small sub that converts the XML string to latin1 (which is the same as iso-8859-1??? or...)
      use Unicode::String qw(utf8 latin1); sub utf8_to_latin1 { # get arguments my ($s1) = @_; # create a utf8 string with initial value of s1 # strange! it should already be in utf8 my $utf8 = Unicode::String->new( $s1 ); # convert to latin1 my $s2 = Unicode::String::latin1($utf8); return $s2; }
      but I don't understand why I have to create a new utf8 string, when the XML string should already be in utf8?

        You're simply creating an object, you don't actually need it to pass the string to it. Use Unicode::String->stringify_as('latin1'); instead, see Unicode::String documentation :

        $us = Unicode::String->new( $initial_value )
        (...)
        In general it is recommended to import and use one of the encoding specific constructor functions instead of invoking this method.
Re: XML encoding problem
by graff (Chancellor) on Mar 16, 2006 at 02:48 UTC
    I have tried using Encode without any success.

    Then you should write a minimal snippet of code, including an appropriate sample of your data, to show us how you tried it, what you expected to get, and what you actually got.

    Encode works, and anyone dealing with non-ascii character data is likely to find it important in their work. If you had no success with it, you are probably working under some simple misconception about your data or what the module actually does or how to use it. Show us what you tried.