Re: converting utf-8 to ISO-8859-1

If the idea is to display it in a HTML page, the easiest ting to do seems to me, to convert the multibyte characters to numerical entities. Here's one way to achieve that, assuming your string is properly marked as UTF8:

s/([^\0-\x7F])/'&#' . ord($1) . ';'/eg;
[download]

Now if the data didn't come out of the database marked as UTF-8, you've got a problem. My guts tell me you either have this case, or else you're using a perl older than 5.8.0, because the latter would, by default, try to convert UTF-8 to Latin-1 when printed to the output handle (STDOUT), by default, on most systems (= except some OSes that are set to treat all text files as UTF-8. RedHat did that for a while.)

Anyway: to mark a string you are sure is proper UTF8 only not marked as such, the next snippet works well to tell perl it is UTF-8, both in 5.6.x and 5.8.x:

$proper_utf8 = pack 'U0a*', $raw_from_database;
[download]

Even if that fixes your apparent problem, not all is well yet. Not every character in UTF-8 can be turned into Latin-1, so you can still use the modified code from the top on a reduced range:

s/([^\0-\xFF])/'&#' . ord($1) . ';'/eg;
[download]

And while you may already be jumping up and down because of this solution, you still have got a problem: you are assuming that the raw text from the database is valid HTML. You actually still need to escape it. in this case, it's enough to convert "&" to "&" and "<" to "<" — though you may wish to handle ">", and maybe even "\"" too. If you are already using a CGI related module, it probably provides a function to do that. But it's simple enough to do it by hand.

As a summary: you have to do that to every single string that comes out of the database and gets put into the HTML pages. Now that's a pain, huh? Here's a trick that I tend to use: use Interpolation to make a hash call a wrapper function, that handles it all. I don't like its import interface, but you can use it with tie too, with the module unmodified.

And here's the complete code:

use Interpolation;
tie %HTML, "Interpolation", \&escape;  # works with global or with lex
+ical: my %HTML
{
    my %esc;
    BEGIN {
        %esc = ( '&' => '&amp;', '<' => '&lt;',
          '>' => '&gt;', '"' => '&quot;',   # if you want them
        );
    }
    sub escape {
        my $s = pack 'U0a*', shift;
        $s =~ s/([&<>"])/$esc{$1}/g;
        $s =~ s/([^\0-\x7F])/$esc{$1} ||= '&#' . ord($1) . ';'/ge;
        return $s;
    }
}
[download]

so your code becomes:

print qq(
<span class='textTitles'>$HTML{$article->{'title'}}</span><br>
Publication Date: $HTML{$article->{'pub_date'}}<br>
Author: $HTML{$article->{'author'}}<br>
Price: $HTML{PrintablePrice($format_list[0])}<br>
);
[download]

Note how the escape wrapper removes the need for tricks to embed code in a string — which is what Interpolation was designed for, anyway: no more @{[ ... ]}

Comment on Re: converting utf-8 to ISO-8859-1 Select or Download Code

Replies are listed 'Best First'.
Re^2: converting utf-8 to ISO-8859-1 by graff (Chancellor) on Apr 22, 2005 at 04:31 UTC
You're right about converting non-ascii characters to numeric references. As for making sure that perl interprets text data from the database as utf8, the Encode module would suffice: `use Encode; # ... get database text into $text and then: $text = decode( 'utf8', $text );` [download] Assuming that the text data drawn from the database really is valid utf8, the "decode" call will set the utf8-flag, and then the regex substitution you showed will work fine.	[reply] [d/l]
Re^3: converting utf-8 to ISO-8859-1 by bart (Canon) on Apr 22, 2005 at 07:19 UTC
Yes, but Encode comes with 5.8.x only. Nothing comparable exists for earlier perls. My solution will work for 5.6.x too. Earlier perls are UTF-8-unaware, so for those, a different solution has to be handcrafted, still.	[reply]