s/([^\0-\x7F])/'&#' . ord($1) . ';'/eg;
Now if the data didn't come out of the database marked as UTF-8, you've got a problem. My guts tell me you either have this case, or else you're using a perl older than 5.8.0, because the latter would, by default, try to convert UTF-8 to Latin-1 when printed to the output handle (STDOUT), by default, on most systems (= except some OSes that are set to treat all text files as UTF-8. RedHat did that for a while.)
Anyway: to mark a string you are sure is proper UTF8 only not marked as such, the next snippet works well to tell perl it is UTF-8, both in 5.6.x and 5.8.x:
$proper_utf8 = pack 'U0a*', $raw_from_database;
Even if that fixes your apparent problem, not all is well yet. Not every character in UTF-8 can be turned into Latin-1, so you can still use the modified code from the top on a reduced range:
s/([^\0-\xFF])/'&#' . ord($1) . ';'/eg;
And while you may already be jumping up and down because of this solution, you still have got a problem: you are assuming that the raw text from the database is valid HTML. You actually still need to escape it. in this case, it's enough to convert "&" to "&" and "<" to "<" — though you may wish to handle ">", and maybe even "\"" too. If you are already using a CGI related module, it probably provides a function to do that. But it's simple enough to do it by hand.
As a summary: you have to do that to every single string that comes out of the database and gets put into the HTML pages. Now that's a pain, huh? Here's a trick that I tend to use: use Interpolation to make a hash call a wrapper function, that handles it all. I don't like its import interface, but you can use it with tie too, with the module unmodified.
And here's the complete code:
so your code becomes:use Interpolation; tie %HTML, "Interpolation", \&escape; # works with global or with lex +ical: my %HTML { my %esc; BEGIN { %esc = ( '&' => '&', '<' => '<', '>' => '>', '"' => '"', # if you want them ); } sub escape { my $s = pack 'U0a*', shift; $s =~ s/([&<>"])/$esc{$1}/g; $s =~ s/([^\0-\x7F])/$esc{$1} ||= '&#' . ord($1) . ';'/ge; return $s; } }
Note how the escape wrapper removes the need for tricks to embed code in a string — which is what Interpolation was designed for, anyway: no more @{[ ... ]}print qq( <span class='textTitles'>$HTML{$article->{'title'}}</span><br> Publication Date: $HTML{$article->{'pub_date'}}<br> Author: $HTML{$article->{'author'}}<br> Price: $HTML{PrintablePrice($format_list[0])}<br> );
In reply to Re: converting utf-8 to ISO-8859-1
by bart
in thread converting utf-8 (entities?) to ISO-8859-1
by hoffj
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |