weird character problems

MCS has asked for the wisdom of the Perl Monks concerning the following question:

I am using antiword to parse a word document:

open ("ANTIWORD", "-|", "/usr/local/bin/antiword", "-mUTF-8.txt", "$filename") or die "Couldn't fork: $!\n";

Unfortunately, certain characters come up as "<E2><80><99>" for ' when I pipe the output to more. If I just output it, I get " '". I obviously don't want "<E2><80><99>" or " '" (I don't want the extra space) to appear in my database so I'd like to substitute it for '. I tried the following

$line =~ s/<E2><80><99>/'/g;
[download]

But it doesn't work. Also if I redirect the output to a file, instead of "<E2><80><99>" I get "?~@~Y" but

$line =~ s/\?~@~Y/'/g;

Doesn't work either... any ideas as to what is causing it and how I can fix it?

Comment on weird character problems Select or Download Code

Replies are listed 'Best First'.

Re: weird character problems
by graff (Chancellor) on Jan 29, 2004 at 05:38 UTC

I'm not familiar with the command-line usage of "antiword". Do you know why you are specifying the option "-mUTF-8.txt"? It looks like a file name, but the intention is not clear.

In any case, it's true that the 3-byte sequence expressed in hex notation as E2 80 99 is in fact the utf8 form of the unicode point U2019, the "Right Single Quotation Mark", in a section of the unicode table called "General Punctuation". (I'm still trying to figure out why they are calling this "the preferred character to use for apostrophe".)

Since you appear to be using Perl 5.8, you could do the replacement as follows:

use Encode;   # (added this as an update -- you need it)
...

$_ = decode( 'utf8', $_ ); # make sure perl knows this is utf8 data

s/\x{2019}/'/g;   # put in the old-fashioned apostrophe
[download]

[reply]
[d/l]

I'm sure you mean binmode ANTIWORD, ":utf8"; :)

by PodMaster (Abbot) on Jan 29, 2004 at 09:45 UTC

binmode ANTIWORD, ":utf8";

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Re: weird character problems

by John M. Dlugosz (Monsignor) on Jan 29, 2004 at 21:05 UTC

I'm still trying to figure out why they are calling this "the preferred character to use for apostrophe"

That’s what I'm using in the first contraction in this sentence. Zoom in the browser window and see the difference in detail.

[reply]

Re: Re: weird character problems

by MCS (Monk) on Jan 29, 2004 at 18:58 UTC

I specified the "-mUTF-8.txt" as a command line argument to use the UTF-8 mapping of keys. I'm using it because it makes other characters in the word document display properly before I parse it.

[reply]

Re: weird character problems
by Roger (Parson) on Jan 29, 2004 at 04:57 UTC

$line =~ s/[\xE2\x80\x99]/'/g;  # convert each of E2,80,99 into '

# or

$line =~ s/\xE2\x80\x99/'/g;   # convert hex sequence into '
[download]

[reply]
[d/l]

Re: Re: weird character problems

by MCS (Monk) on Jan 29, 2004 at 18:48 UTC

Thanks... the second one works great. I figured they were special characters, I just didn't know what to do with them.

[reply]