I have about 1000 Word documents that are semi-actively modified using Word 2003. They are all text-based, i.e. no pictures, graphs or the like, and contain among other pieces of data, names of people. I need to find the names and then the page number that name is on and then create an index appended to the end of the concatenation of all the doc files. I have written a perl script using win32:ole and have experienced some success, but it seems that win32:ole is poorly documented and somewhat flaky. The script works more or less. but I can't seem to fix the last remaining bugs. They are related to saving the file and opening and closing the documents I believe. I was wondering if there is a better way of doing this? I would have preferred to keep the finished document as a word document, but perhaps this is problematic. Would it be better to extract the text from the word docs and convert it to a pdf file. I can't determine if it is possible to find some text in a pdf file and get the page number the text is on using these: pdf::api2, cam::pdf, pdf::core. Also, I was wondering plain text may be a beter choice? I can provide the perl script I have using win32::ole if it is of any help. Thanks.

In reply to Indexing of Word documents by axiomcrs

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.