in reply to weird character problems
I'm not familiar with the command-line usage of "antiword". Do you know why you are specifying the option "-mUTF-8.txt"? It looks like a file name, but the intention is not clear.
In any case, it's true that the 3-byte sequence expressed in hex notation as E2 80 99 is in fact the utf8 form of the unicode point U2019, the "Right Single Quotation Mark", in a section of the unicode table called "General Punctuation". (I'm still trying to figure out why they are calling this "the preferred character to use for apostrophe".)
Since you appear to be using Perl 5.8, you could do the replacement as follows:
And similarly for other "preferred forms" of unicode punctuation, I'd expect. If, when your replacements are all done, there are no non-ASCII characters left in the data, then printing it to a terminal or whatever should show you what you expect to see. But if any non-ascii (utf8 multibyte) data remains, you need a utf8-aware display tool to see these characters as there were meant to be seen.use Encode; # (added this as an update -- you need it) ... $_ = decode( 'utf8', $_ ); # make sure perl knows this is utf8 data s/\x{2019}/'/g; # put in the old-fashioned apostrophe
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
I'm sure you mean binmode ANTIWORD, ":utf8"; :)
by PodMaster (Abbot) on Jan 29, 2004 at 09:45 UTC | |
|
Re: Re: weird character problems
by John M. Dlugosz (Monsignor) on Jan 29, 2004 at 21:05 UTC | |
|
Re: Re: weird character problems
by MCS (Monk) on Jan 29, 2004 at 18:58 UTC |