comment on

If the idea is to display it in a HTML page, the easiest ting to do seems to me, to convert the multibyte characters to numerical entities. Here's one way to achieve that, assuming your string is properly marked as UTF8:

s/([^\0-\x7F])/'&#' . ord($1) . ';'/eg;
[download]

Now if the data didn't come out of the database marked as UTF-8, you've got a problem. My guts tell me you either have this case, or else you're using a perl older than 5.8.0, because the latter would, by default, try to convert UTF-8 to Latin-1 when printed to the output handle (STDOUT), by default, on most systems (= except some OSes that are set to treat all text files as UTF-8. RedHat did that for a while.)

Anyway: to mark a string you are sure is proper UTF8 only not marked as such, the next snippet works well to tell perl it is UTF-8, both in 5.6.x and 5.8.x:

$proper_utf8 = pack 'U0a*', $raw_from_database;
[download]

Even if that fixes your apparent problem, not all is well yet. Not every character in UTF-8 can be turned into Latin-1, so you can still use the modified code from the top on a reduced range:

s/([^\0-\xFF])/'&#' . ord($1) . ';'/eg;
[download]

And while you may already be jumping up and down because of this solution, you still have got a problem: you are assuming that the raw text from the database is valid HTML. You actually still need to escape it. in this case, it's enough to convert "&" to "&" and "<" to "<" — though you may wish to handle ">", and maybe even "\"" too. If you are already using a CGI related module, it probably provides a function to do that. But it's simple enough to do it by hand.

As a summary: you have to do that to every single string that comes out of the database and gets put into the HTML pages. Now that's a pain, huh? Here's a trick that I tend to use: use Interpolation to make a hash call a wrapper function, that handles it all. I don't like its import interface, but you can use it with tie too, with the module unmodified.

And here's the complete code:

use Interpolation;
tie %HTML, "Interpolation", \&escape;  # works with global or with lex
+ical: my %HTML
{
    my %esc;
    BEGIN {
        %esc = ( '&' => '&amp;', '<' => '&lt;',
          '>' => '&gt;', '"' => '&quot;',   # if you want them
        );
    }
    sub escape {
        my $s = pack 'U0a*', shift;
        $s =~ s/([&<>"])/$esc{$1}/g;
        $s =~ s/([^\0-\x7F])/$esc{$1} ||= '&#' . ord($1) . ';'/ge;
        return $s;
    }
}
[download]

so your code becomes:

print qq(
<span class='textTitles'>$HTML{$article->{'title'}}</span><br>
Publication Date: $HTML{$article->{'pub_date'}}<br>
Author: $HTML{$article->{'author'}}<br>
Price: $HTML{PrintablePrice($format_list[0])}<br>
);
[download]

Note how the escape wrapper removes the need for tricks to embed code in a string — which is what Interpolation was designed for, anyway: no more @{[ ... ]}

In reply to Re: converting utf-8 to ISO-8859-1 by bart
in thread converting utf-8 (entities?) to ISO-8859-1 by hoffj

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.