fionbarr has asked for the wisdom of the Perl Monks concerning the following question:

I've been using win32::ole to read MSword docs using regular expressions to look for words/phrases. I found that by using binmode and 'slurping' the whole file I can still find the same words/phrases. Any down-side to doing it this way? (I don't need to read the document, just find stuff)
  • Comment on using binmode and slurp with Winword *.doc

Replies are listed 'Best First'.
Re: using binmode and slurp with Winword *.doc
by MidLifeXis (Monsignor) on Nov 03, 2009 at 17:42 UTC

    Just off the top of my head - may be valid or total bunk.

    • If I remember correctly, there used to be something called fast save or quick save, or something like that which would not rewrite the entire file. This could give you some false positives.
    • UTF16 issues?
    • docx format (any encryption or compression going on?
    • Encrypted documents (which I would think you would have problems with anyway)

    --MidLifeXis

      interesting point about docx...I don't think I've tried that...the procedure dies when I try to open password-protected files...that is not a problem though; I won't have them in production.

      Just a small note: docx, like all those other legacy+X-extensions from the newer MS Office versions, is a ZIP file containing XML and some helper files. Perl can unzip, perl can do really weired things with XML, so docx and friends are easier to handle than the classic binary garbage formats.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: using binmode and slurp with Winword *.doc
by ikegami (Patriarch) on Nov 03, 2009 at 18:02 UTC
    I imagine you'd face the following problems:
    • Character encoding issues
    • Document encoding issues
    • False positives from non-text components of the document (e.g. If I search for the word "bold", will I find matches that aren't in the document?)
    • False negatives from text interrupted by formating codes (e.g. Can you match a sentence containing a bolded word?)