merrymonk has asked for the wisdom of the Perl Monks concerning the following question:

I am developing an application that replaces character strings in a Word document.
The code below does this as I need. The replacement text is set up in the lines $doll_str{xx<n>} = replacement string.
However, I would like the replacement string to include none normal alpha/numeric characters such as
those found in the Windings style and in particular the ‘tick’.
More in hope than expectation I have tried copying the tick mark from a Word document into Perl but this gave just a double quote.
How can this be done?
use strict; use Win32::OLE; use Win32::OLE::Const 'Microsoft Word'; # set $filepath to the full path to the directory where the file to be + changed, test.doc, is stored my $filepath = 'C:/tmp'; my $oldfile = $filepath . '/test.doc'; my $newfile = $filepath . '/test1.doc'; my ($oldtext, $newtext); my (%doll_str, $doll_item, $search_res, $replace_res, $exec_res); $doll_str{xx1} = 'fred 1'; $doll_str{xx2} = 'orange 2'; $doll_str{xx3} = 'green 3'; $doll_str{xx4} = 'blue 4'; $doll_str{xx5} = 'black 5'; my $word = Win32::OLE-> GetActiveObject('Word.Application') || Win32::OLE-> new('Word.Application','Quit'); my $doc = $word-> Documents->Open("$oldfile"); # is application visible 0=no 1=yes $word-> {visible} = 0; my $search = $doc-> Content->Find; my $replace = $search-> Replacement; $search-> {Text} = $oldtext; $replace-> {Text} = $newtext; $search-> Execute({Replace => wdReplaceAll}); foreach $doll_item (sort {$a cmp $b} keys %doll_str) { $search-> {Text} = $doll_item; $replace-> {Text} = $doll_str{$doll_item}; $exec_res = $search-> Execute({Replace => wdReplaceAll}); print "item <$doll_item> exec <$exec_res>/n"; } # save word file $word-> ActiveDocument->SaveAs($newfile); # close word file $doc-> Close(); $word-> Quit();

Replies are listed 'Best First'.
Re: Replacing none alpha/numeric characters in a Word document
by bart (Canon) on Dec 17, 2008 at 10:21 UTC
    I don't understand if you do or if you don't want these special characters. Anyway, they are in Microsoft's extension to ISO-Latin-1, known as "CODE PAGE 1252".

    If you do want these characters, you can look up their codes at the character codes list on unicode.org's website, and simply insert them with Perl. (These extension characters are all in the range 0x80-0x9F.)

    If you don't: you should realize that Word has a tendency to automatically replace simple ASCII quotes in documents with so-called "smart quotes", using a heuristic algorithm considering the neighboring characters to decide whether a left- or a right quote may need to be inserted. If you want plain ASCII, you have to fiurst turn this off.

    HTH.

      I am soory I was not clear!
      I DO want to use these special characters therefore I am now looking at the reference you gave me.
      I have now used your suggestion and looked in the reference.
      As I could not see a tick referenced I tried the code for 1/2 as one character.
      This is 0x00BD.
      using = 0x00BD for xx5 I got 189 in the modified Word document (the decimal value for hex BD).
      using = '0x00BD' for xx I got 0x00BD.
      I can see why this happened.
      Therefore what do I have to do to get the 1/2 as a single character?
      Also how do I get to characters in other 'fonts' such as Wingdings?
Re: Replacing none alpha/numeric characters in a Word document
by graff (Chancellor) on Dec 18, 2008 at 01:52 UTC
    Have you tried using Word manually to create a doc file with the desired characters in it? If you do that and inspect the contents of the doc file, you'll probably learn what you need to know. I'm no expert on this, but I have had occasion to inspect hex dumps of Word doc files that contained non-ASCII character data. It's very strange.

    If you are mixing "smart quotes" and "wingdings", you may actually end up with UTF-16LE encoding of unicode in the doc file. Maybe the OLE module stuff will handle the gory details for you so you don't need to worry about it too much, and all you might need to do is provide the Unicode code-points for the characters you want (that is, specify them with hex notation, e.g.: "\x{201C}...\x{201D}" for a string surrounded by single-quotes).

    I'm just guessing at that, but if nothing else has worked yet, it's worth a try. In case it helps, there's a handy little tool here that I copied from a posting on the perl-unicode mail list, to search for various characters in the unicode table.

      Using "/x{201C}.../x{201D}" did work as I wanted – many thanks.
      In the latest version of Word I used the add symbol function to add a tick. I then highlighted the tick and Word suggested this was Times font character.
      As I did not know how to go via the hex dump route, I organized a Word file with lines using the times font of
      001 A001
      to
      256 A256
      The Perl code then had lines
      $doll_str{A001} = "/x{01}";
      to
      $doll_str{A256} = "/x{0100}";
      Which I used to change the word document
      I then looked at the modified document but I could not find the tick symbol.
      I suspect that this means that the tick is not in the Times set.
      Can anyone tell me if this is true?/
      If so I would appreciate help in moving forward in my quest to be able to modify a Word document to
      add ticks and any other characters that are available with the various fonts that can be used with Word.
        "Times" is a font, not a character encoding. The font says what the characters look like when displayed or printed; the encoding says which bit patterns (numeric values) are mapped to which characters.

        Your lines ranging from "001 A001" to "256 A256" don't make any sense to me. What was the point of that? Note that "\x{01}" is an ASCII control character, likewise up through "\x{19}", and for "\x{7f}") -- these will display nothing (or could make your display do weird things). Also, please be careful that you don't confuse "backslash" \ with "forward slash" / -- these are very different things.

        If using the "\x{....}" works in terms of allowing the perl script to insert any Unicode character you want into the doc file, then what more is there to worry about? That's the way to go. You just need to be able to find the hex-numeric unicode code-point value for the characters you want to insert. (That's why I pointed to that "handy tool", to provide a way to search the unicode character table.)

        For example, as you should have figured out by now, "\x{200C}" is the "left single quotation mark" and "\x{200D}" is the "right single quotation mark", regardless whether you are using a Times font, or Courier, or Arial, or Helvetica, or ...

        As for going "via the hex dump route", if you have "unix tools for windows", you can check out either "od" or "xxd" (though I am sure there are hexdump tools that are "native" to windows, as well). Or you can whip up something pretty easily in perl -- here's a basic/simple hexdump tool:

        #!/usr/bin/perl die "Usage: $0 file_name\n" unless (@ARGV==1 and -f $ARGV[0]); open( I, "<", $ARGV[0] ); binmode I; $offset = 0; while ( $n = read( I, $b, 16 )) { ( $c = $b ) =~ s/[^[:print:]]/./g; printf( "%08x: %-47s %s\n", $offset, join( " ", map{ sprintf( "%02x", $_ )} unpack( "C*", $b )) +, $c ); $offset += $n; }
        (But really, learning to use tools like "od" or "xxd" is better.)