Is that supposed to be a Perl "open()" statement that you posted there? Why do you put quotation marks around the file handle? (I've just always seen a bareword in that position, not a quoted (literal) string. Personally, I'd get rid of the quotes around that first arg to open.)

I'm not familiar with the command-line usage of "antiword". Do you know why you are specifying the option "-mUTF-8.txt"? It looks like a file name, but the intention is not clear.

In any case, it's true that the 3-byte sequence expressed in hex notation as E2 80 99 is in fact the utf8 form of the unicode point U2019, the "Right Single Quotation Mark", in a section of the unicode table called "General Punctuation". (I'm still trying to figure out why they are calling this "the preferred character to use for apostrophe".)

Since you appear to be using Perl 5.8, you could do the replacement as follows:

use Encode; # (added this as an update -- you need it) ... $_ = decode( 'utf8', $_ ); # make sure perl knows this is utf8 data s/\x{2019}/'/g; # put in the old-fashioned apostrophe
And similarly for other "preferred forms" of unicode punctuation, I'd expect. If, when your replacements are all done, there are no non-ASCII characters left in the data, then printing it to a terminal or whatever should show you what you expect to see. But if any non-ascii (utf8 multibyte) data remains, you need a utf8-aware display tool to see these characters as there were meant to be seen.

In reply to Re: weird character problems by graff
in thread weird character problems by MCS

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.