character encoding question

MaskedMarauder has asked for the wisdom of the Perl Monks concerning the following question:

hi,
I get some input from a webform textarea elements I don't know what to do with. I suspect the users are cutting & pasting from an MS application of some sort, but the normal recoding tricks ( that I'm familiar with ) don't work. In particular, I have this example: text in normal UTF-8 with a couple 3 byte blocks of stuff. I'm told they're "just" apostrophes, but I don't see it. The input would look like this:

blah blah blah <E2><80><9C> blah blah blah <E2><80><9D> blah blah blah

Does anyone recognize this encoding and, if so, is there a standard way of dealing with it? iconv with the usual MS encodings doesn't do the trick.
Thanks.

Comment on character encoding question Download Code

Replies are listed 'Best First'.
Re: character encoding question by dave_the_m (Monsignor) on Jun 01, 2004 at 22:35 UTC
blah blah blah <E2><80><9C> blah blah blah <E2><80><9D> blah blah blah This looks like a utf-8 encoding of the Unicode code points 0x201c and 0x201d, which correspond to opening and closing double quotes (see General Punctuation at the Unicode web site for more details). You can convert the 3 raw bytes into a single Unicode character as follows: `$str = "\xE2\x80\x9c"; use Encode; $d = decode_utf8($str); print "ok\n" if length($d)== 1 && $d eq "\x{201c}";` [download] I don't know any general way converting obscure punctuation codes into simple near-equivalents. Dave.	[reply] [d/l]
Re: Re: character encoding question by MaskedMarauder (Acolyte) on Jun 02, 2004 at 05:15 UTC
Thanks, this helped a lot. The content in question was being bassed through HTML::FromText and getting snarled up. I figured that FromText would do the right thing with utf-8 so when it failed I thought it must be something exotic. Silly me. It turns out the machine this is on has 5.6.1 Perl that doesn't do utf-8 particularly well. Your discussion pointed to the way for me to hadle it though; I'll just use a lookup table to handle the problem characters for the time being until the Perl gets upgraded. John	[reply]
Re: character encoding question by kvale (Monsignor) on Jun 01, 2004 at 22:24 UTC
Googling on `<E2><80><9C>` finds a beginning double quote on some German sites, and I assume that the other is an ending qdouble quote. UTF-8 has some 3 character codes, so you may want to start there. -Mark	[reply] [d/l]
Re: Re: character encoding question by MaskedMarauder (Acolyte) on Jun 02, 2004 at 05:06 UTC
Thanks, it never occurred to me to google it. I should have said the '<...>' stuff was cut & pasted from catting in an xterm and never thought to search for occurences.	[reply]
Re: character encoding question by El Linko (Beadle) on Jun 01, 2004 at 23:01 UTC
This looks like its just the utf8 bytes coded in hex. `$_=~s/<([a-f0-9][a-f0-9])>/chr(hex($1))/ieg;` [download] This will reverse it (as long as people arn't in the habit of submitting <E2> and meaning it. I think <E2><80><9D> is the so called smart quote( I vaguely recall seeing it somewhere).	[reply] [d/l]
Re: character encoding question by Anonymous Monk on Jun 02, 2004 at 05:29 UTC
What's the encoding of your webpage? You should specify one.	[reply]
Re^2: character encoding question by MaskedMarauder (Acolyte) on Jun 03, 2004 at 02:49 UTC
its specified as utf-8 but that doesn't do the perl backend any good.	[reply]