Re: weird character problems

Is that supposed to be a Perl "open()" statement that you posted there? Why do you put quotation marks around the file handle? (I've just always seen a bareword in that position, not a quoted (literal) string. Personally, I'd get rid of the quotes around that first arg to open.)

I'm not familiar with the command-line usage of "antiword". Do you know why you are specifying the option "-mUTF-8.txt"? It looks like a file name, but the intention is not clear.

In any case, it's true that the 3-byte sequence expressed in hex notation as E2 80 99 is in fact the utf8 form of the unicode point U2019, the "Right Single Quotation Mark", in a section of the unicode table called "General Punctuation". (I'm still trying to figure out why they are calling this "the preferred character to use for apostrophe".)

Since you appear to be using Perl 5.8, you could do the replacement as follows:

use Encode;   # (added this as an update -- you need it)
...

$_ = decode( 'utf8', $_ ); # make sure perl knows this is utf8 data

s/\x{2019}/'/g;   # put in the old-fashioned apostrophe
[download]

And similarly for other "preferred forms" of unicode punctuation, I'd expect. If, when your replacements are all done, there are no non-ASCII characters left in the data, then printing it to a terminal or whatever should show you what you expect to see. But if any non-ascii (utf8 multibyte) data remains, you need a utf8-aware display tool to see these characters as there were meant to be seen.

Comment on Re: weird character problems Download Code

Replies are listed 'Best First'.

I'm sure you mean binmode ANTIWORD, ":utf8"; :)
by PodMaster (Abbot) on Jan 29, 2004 at 09:45 UTC

binmode ANTIWORD, ":utf8";

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re: Re: weird character problems
by John M. Dlugosz (Monsignor) on Jan 29, 2004 at 21:05 UTC

I'm still trying to figure out why they are calling this "the preferred character to use for apostrophe"

That’s what I'm using in the first contraction in this sentence. Zoom in the browser window and see the difference in detail.

[reply]

Re: Re: weird character problems
by MCS (Monk) on Jan 29, 2004 at 18:58 UTC

I specified the "-mUTF-8.txt" as a command line argument to use the UTF-8 mapping of keys. I'm using it because it makes other characters in the word document display properly before I parse it.

[reply]