MCS has asked for the wisdom of the Perl Monks concerning the following question:

I am using antiword to parse a word document:

open ("ANTIWORD", "-|", "/usr/local/bin/antiword", "-mUTF-8.txt", "$filename") or die "Couldn't fork: $!\n";

Unfortunately, certain characters come up as "<E2><80><99>" for ' when I pipe the output to more. If I just output it, I get " '". I obviously don't want "<E2><80><99>" or " '" (I don't want the extra space) to appear in my database so I'd like to substitute it for '. I tried the following

$line =~ s/<E2><80><99>/'/g;

But it doesn't work. Also if I redirect the output to a file, instead of "<E2><80><99>" I get "?~@~Y" but

$line =~ s/\?~@~Y/'/g;

Doesn't work either... any ideas as to what is causing it and how I can fix it?

Replies are listed 'Best First'.
Re: weird character problems
by graff (Chancellor) on Jan 29, 2004 at 05:38 UTC
    Is that supposed to be a Perl "open()" statement that you posted there? Why do you put quotation marks around the file handle? (I've just always seen a bareword in that position, not a quoted (literal) string. Personally, I'd get rid of the quotes around that first arg to open.)

    I'm not familiar with the command-line usage of "antiword". Do you know why you are specifying the option "-mUTF-8.txt"? It looks like a file name, but the intention is not clear.

    In any case, it's true that the 3-byte sequence expressed in hex notation as E2 80 99 is in fact the utf8 form of the unicode point U2019, the "Right Single Quotation Mark", in a section of the unicode table called "General Punctuation". (I'm still trying to figure out why they are calling this "the preferred character to use for apostrophe".)

    Since you appear to be using Perl 5.8, you could do the replacement as follows:

    use Encode; # (added this as an update -- you need it) ... $_ = decode( 'utf8', $_ ); # make sure perl knows this is utf8 data s/\x{2019}/'/g; # put in the old-fashioned apostrophe
    And similarly for other "preferred forms" of unicode punctuation, I'd expect. If, when your replacements are all done, there are no non-ASCII characters left in the data, then printing it to a terminal or whatever should show you what you expect to see. But if any non-ascii (utf8 multibyte) data remains, you need a utf8-aware display tool to see these characters as there were meant to be seen.
      I'm sure you mean binmode ANTIWORD, ":utf8"; :) instead of that Encode business

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

      Re: I'm still trying to figure out why they are calling this "the preferred character to use for apostrophe"

      That’s what I'm using in the first contraction in this sentence. Zoom in the browser window and see the difference in detail.

      I specified the "-mUTF-8.txt" as a command line argument to use the UTF-8 mapping of keys. I'm using it because it makes other characters in the word document display properly before I parse it.

Re: weird character problems
by Roger (Parson) on Jan 29, 2004 at 04:57 UTC
    The "<E2><80><99>" etc you are getting look like hex numbers. So you could try this instead:
    $line =~ s/[\xE2\x80\x99]/'/g; # convert each of E2,80,99 into ' # or $line =~ s/\xE2\x80\x99/'/g; # convert hex sequence into '

      Thanks... the second one works great. I figured they were special characters, I just didn't know what to do with them.