MaskedMarauder has asked for the wisdom of the Perl Monks concerning the following question:

hi,
I get some input from a webform textarea elements I don't know what to do with. I suspect the users are cutting & pasting from an MS application of some sort, but the normal recoding tricks ( that I'm familiar with ) don't work. In particular, I have this example: text in normal UTF-8 with a couple 3 byte blocks of stuff. I'm told they're "just" apostrophes, but I don't see it. The input would look like this:

blah blah blah <E2><80><9C> blah blah blah <E2><80><9D> blah blah blah

Does anyone recognize this encoding and, if so, is there a standard way of dealing with it? iconv with the usual MS encodings doesn't do the trick.
Thanks.

Replies are listed 'Best First'.
Re: character encoding question
by dave_the_m (Monsignor) on Jun 01, 2004 at 22:35 UTC
    blah blah blah <E2><80><9C> blah blah blah <E2><80><9D> blah blah blah
    This looks like a utf-8 encoding of the Unicode code points 0x201c and 0x201d, which correspond to opening and closing double quotes (see General Punctuation at the Unicode web site for more details).

    You can convert the 3 raw bytes into a single Unicode character as follows:

    $str = "\xE2\x80\x9c"; use Encode; $d = decode_utf8($str); print "ok\n" if length($d)== 1 && $d eq "\x{201c}";
    I don't know any general way converting obscure punctuation codes into simple near-equivalents.

    Dave.

      Thanks, this helped a lot. The content in question was being bassed through HTML::FromText and getting snarled up. I figured that FromText would do the right thing with utf-8 so when it failed I thought it must be something exotic. Silly me. It turns out the machine this is on has 5.6.1 Perl that doesn't do utf-8 particularly well.

      Your discussion pointed to the way for me to hadle it though; I'll just use a lookup table to handle the problem characters for the time being until the Perl gets upgraded.

      John
Re: character encoding question
by kvale (Monsignor) on Jun 01, 2004 at 22:24 UTC
    Googling on <E2><80><9C> finds a beginning double quote on some German sites, and I assume that the other is an ending qdouble quote. UTF-8 has some 3 character codes, so you may want to start there.

    -Mark

      Thanks, it never occurred to me to google it. I should have said the '<...>' stuff was cut & pasted from catting in an xterm and never thought to search for occurences.
Re: character encoding question
by El Linko (Beadle) on Jun 01, 2004 at 23:01 UTC
    This looks like its just the utf8 bytes coded in hex.
    $_=~s/<([a-f0-9][a-f0-9])>/chr(hex($1))/ieg;
    This will reverse it (as long as people arn't in the habit of submitting <E2> and meaning it.
    I think <E2><80><9D> is the so called smart quote( I vaguely recall seeing it somewhere).

Re: character encoding question
by Anonymous Monk on Jun 02, 2004 at 05:29 UTC
    What's the encoding of your webpage? You should specify one.
      its specified as utf-8 but that doesn't do the perl backend any good.