hoffj has asked for the wisdom of the Perl Monks concerning the following question:

I need to pull some info about an article out of a database and display it on a Web page, but ran into a problem with non-english text.

The text was taken from a utf-8 XML document and put into an Oracle database. I want to get it out of the database and then display it in an HTML page as iso-8859-1.

Addition (4/22): After reviewing the xml documents, I found that the xml document contains entities and not the special characters (as I had previously thought).

For example: 'Jos&\#233; El&\#237;as' 'José Elías' was parsed from the XML document and stored in the db as 'José Elía'. When I query and display the text it displays as 'José Elía'.

I want the text to display properly on the page, so I need to either convert it back to 'José Elía' or I could display it as 'Jos&\#233 El&\#237a'. What is the best way to accomplish this?

This is only a problem with the title and author, the rest of the page renders properly as iso-8859-1.

Here is what the script is currently doing:

... print qq( <span class='textTitles'>$article->{'title'}</span><br> Publication Date: $article->{'pub_date'}<br> Author: $article->{'author'}<br> Price: @{[ PrintablePrice($format_list[0]) ]}<br> ); ...

UPDATE (4/22):

Thanks for all of the responses.

FYI - I am using perl 5.8.

Using encode, decode and Bart's solution did not produce the proper results--the data was unchanged.

I have a major complication... see the addition above. I don't even know what the title of this question should be now. How can I get back to the correct entity or special character?

Replies are listed 'Best First'.
Re: converting utf-8 to ISO-8859-1
by bart (Canon) on Apr 22, 2005 at 02:17 UTC
    If the idea is to display it in a HTML page, the easiest ting to do seems to me, to convert the multibyte characters to numerical entities. Here's one way to achieve that, assuming your string is properly marked as UTF8:
    s/([^\0-\x7F])/'&#' . ord($1) . ';'/eg;

    Now if the data didn't come out of the database marked as UTF-8, you've got a problem. My guts tell me you either have this case, or else you're using a perl older than 5.8.0, because the latter would, by default, try to convert UTF-8 to Latin-1 when printed to the output handle (STDOUT), by default, on most systems (= except some OSes that are set to treat all text files as UTF-8. RedHat did that for a while.)

    Anyway: to mark a string you are sure is proper UTF8 only not marked as such, the next snippet works well to tell perl it is UTF-8, both in 5.6.x and 5.8.x:

    $proper_utf8 = pack 'U0a*', $raw_from_database;

    Even if that fixes your apparent problem, not all is well yet. Not every character in UTF-8 can be turned into Latin-1, so you can still use the modified code from the top on a reduced range:

    s/([^\0-\xFF])/'&#' . ord($1) . ';'/eg;

    And while you may already be jumping up and down because of this solution, you still have got a problem: you are assuming that the raw text from the database is valid HTML. You actually still need to escape it. in this case, it's enough to convert "&" to "&amp;" and "<" to "&lt;" — though you may wish to handle ">", and maybe even "\"" too. If you are already using a CGI related module, it probably provides a function to do that. But it's simple enough to do it by hand.

    As a summary: you have to do that to every single string that comes out of the database and gets put into the HTML pages. Now that's a pain, huh? Here's a trick that I tend to use: use Interpolation to make a hash call a wrapper function, that handles it all. I don't like its import interface, but you can use it with tie too, with the module unmodified.

    And here's the complete code:

    use Interpolation; tie %HTML, "Interpolation", \&escape; # works with global or with lex +ical: my %HTML { my %esc; BEGIN { %esc = ( '&' => '&amp;', '<' => '&lt;', '>' => '&gt;', '"' => '&quot;', # if you want them ); } sub escape { my $s = pack 'U0a*', shift; $s =~ s/([&<>"])/$esc{$1}/g; $s =~ s/([^\0-\x7F])/$esc{$1} ||= '&#' . ord($1) . ';'/ge; return $s; } }
    so your code becomes:
    print qq( <span class='textTitles'>$HTML{$article->{'title'}}</span><br> Publication Date: $HTML{$article->{'pub_date'}}<br> Author: $HTML{$article->{'author'}}<br> Price: $HTML{PrintablePrice($format_list[0])}<br> );
    Note how the escape wrapper removes the need for tricks to embed code in a string — which is what Interpolation was designed for, anyway: no more @{[ ... ]}
      You're right about converting non-ascii characters to numeric references. As for making sure that perl interprets text data from the database as utf8, the Encode module would suffice:
      use Encode; # ... get database text into $text and then: $text = decode( 'utf8', $text );
      Assuming that the text data drawn from the database really is valid utf8, the "decode" call will set the utf8-flag, and then the regex substitution you showed will work fine.
        Yes, but Encode comes with 5.8.x only. Nothing comparable exists for earlier perls. My solution will work for 5.6.x too.

        Earlier perls are UTF-8-unaware, so for those, a different solution has to be handcrafted, still.

Re: converting utf-8 to ISO-8859-1
by gaal (Parson) on Apr 21, 2005 at 20:17 UTC
    If your data is utf8, you have to convert it to iso-8859-1 yourself.

    use Encode; $eightbit = Encode("iso-8859-1", $utf8_data_from_database);
      I agree. But be aware that not everything which can be encoded in UTF-8 can also be encoded in 8859-1.

      Ordinary morality is for ordinary people. -- Aleister Crowley