in reply to character encoding question

blah blah blah <E2><80><9C> blah blah blah <E2><80><9D> blah blah blah
This looks like a utf-8 encoding of the Unicode code points 0x201c and 0x201d, which correspond to opening and closing double quotes (see General Punctuation at the Unicode web site for more details).

You can convert the 3 raw bytes into a single Unicode character as follows:

$str = "\xE2\x80\x9c"; use Encode; $d = decode_utf8($str); print "ok\n" if length($d)== 1 && $d eq "\x{201c}";
I don't know any general way converting obscure punctuation codes into simple near-equivalents.

Dave.

Replies are listed 'Best First'.
Re: Re: character encoding question
by MaskedMarauder (Acolyte) on Jun 02, 2004 at 05:15 UTC
    Thanks, this helped a lot. The content in question was being bassed through HTML::FromText and getting snarled up. I figured that FromText would do the right thing with utf-8 so when it failed I thought it must be something exotic. Silly me. It turns out the machine this is on has 5.6.1 Perl that doesn't do utf-8 particularly well.

    Your discussion pointed to the way for me to hadle it though; I'll just use a lookup table to handle the problem characters for the time being until the Perl gets upgraded.

    John