Re: character encoding question

blah blah blah <E2><80><9C> blah blah blah <E2><80><9D> blah blah blah

This looks like a utf-8 encoding of the Unicode code points 0x201c and 0x201d, which correspond to opening and closing double quotes (see General Punctuation at the Unicode web site for more details).

You can convert the 3 raw bytes into a single Unicode character as follows:

$str = "\xE2\x80\x9c";
use Encode;
$d = decode_utf8($str);

print "ok\n" if length($d)== 1 && $d eq "\x{201c}";
[download]

I don't know any general way converting obscure punctuation codes into simple near-equivalents.

Dave.

Comment on Re: character encoding question Download Code

Replies are listed 'Best First'.
Re: Re: character encoding question by MaskedMarauder (Acolyte) on Jun 02, 2004 at 05:15 UTC
Thanks, this helped a lot. The content in question was being bassed through HTML::FromText and getting snarled up. I figured that FromText would do the right thing with utf-8 so when it failed I thought it must be something exotic. Silly me. It turns out the machine this is on has 5.6.1 Perl that doesn't do utf-8 particularly well. Your discussion pointed to the way for me to hadle it though; I'll just use a lookup table to handle the problem characters for the time being until the Perl gets upgraded. John	[reply]

Replies are listed 'Best First'.

Re: Re: character encoding question
by MaskedMarauder (Acolyte) on Jun 02, 2004 at 05:15 UTC

[reply]