Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I am looking to write a script in PERL that can parse an MS Word document and retrieve the built-in and/or custom document properties (Word Count, Title, Subject, etc), so I can push all this info into a PostgreSQL db. I would also like to be able to change these properties in the database and write them back to the file.

I grabbed a copy of the binary file format (http://realm.progsoc.uts.edu.au/~subtle/wword8.html) but I seem to have a recollection of a module which might do this.

Hopefully somebody may have been down this path already and could get me of to a headstart

Replies are listed 'Best First'.
Re: Parsing an MS Word document
by t0mas (Priest) on Dec 11, 2000 at 14:49 UTC
    You could use Win32::OLE to make Word tell you. I did somthing like this to list all properties of all documents in my documents tree. Performance isn't to good though, about 3 documents per second on my sturdy old P233 with Win2k. Here's my code anyway. Hope it will help.
    #!/usr/bin/perl -w # Uses use strict; use Win32::OLE; use Win32::OLE::Variant; use Win32::OLE::Const; use File::Find; # We want to handle collections Win32::OLE->Option(_NewEnum => 1); # Variables use vars qw($MSWord $wd $startdir); # Where to start the doc search $startdir='d:/documents'; # Create new MSWord object and load constants $MSWord=Win32::OLE->new('Word.Application','Quit') or die "Could not load MS Word"; $wd=Win32::OLE::Const->Load($MSWord); # Find documents find(\&getProps,$startdir); ###################################### sub getProps { # Find sub # We only want .doc files return unless /\.doc/ && -f; # No OLE warnings please local $Win32::OLE::Warn = 0; # Open document my $doc = $MSWord->Documents->Open({FileName=>$File::Find::name}); # Exit nicely if we couldn't open doc return unless $doc; # Print header print "\n-----------------------------------------------\n"; print "Document: $File::Find::name\n"; print "-----------------------------------------------\n"; # List document properties foreach my $prop (@{$doc->BuiltInDocumentProperties->{_NewEnum}}, @{$doc->CustomDocumentProperties->{_NewEnum}}) { if (defined $prop->{Name}) { print $prop->{Name}; if (defined $prop->{Value}) { # Variants... if ($prop->{Value}=~/^Win32::OLE::Variant/) { print ": ".valof $prop->{Value}; } else { print ": ".$prop->{Value}; } } else { print ": undefined" } print "\n"; } } # Close document $doc->Close({SaveChanges=>$wd->{wdDoNotSaveChanges}}); }


    /brother t0mas
Re: Parsing an MS Word document
by clemburg (Curate) on Dec 11, 2000 at 14:56 UTC

    You might have seen LAOLA, but I doubt it will make you happy today.

    Go with the advice of brother t0mas, and use Win32::OLE. It will make you much more happy. There is a tutorial on Win32::OLE online (by Jan Dubois), and there are also good articles on it in previous issues of The Perl Journal.

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

      ++ brother clemburg!

      Perl Journal Rules!
      To be honest there don't seem to be a great deal of modules out there that interface with Windows apps. There are so many Windows apps out there it's not really worth creating a module for it when OLE automation is straightforward once you know where to start.

      SpreadSheet::Excel is great and it's obvious that the author went to a great deal of work to write the module without using OLE.

      I've seen a lot of nodes here recently requesting help automating Windows tasks that are very suited to Win32::OLE. Maybe we could post some easy examples to the Code Catacombs?

      On seconds thoughts - I will write a Module Review for Win32::OLE.

      Update: Drat! Rudif has already written a pretty good module review for Win32::OLE. Would it hurt to write another, or would you recommend that I add a follow-on to his node with some more examples?

        How could it hurt to have more informed opinions on such a useful module?

        In terms of priorities, I think examples abound in other references (like TPJ, books, module docs, online tutorials, etc.). Maybe you could add a summary of pointers to such information to your review? It is hard to come up with good examples. Chances are the work has already been done by others. Why not just collect it into one place?

        I think this would be a great service to all of us.

        Anyway, thanks for your offer!

        Christian Lemburg
        Brainbench MVP for Perl
        http://www.brainbench.com

        Win32::OLE will only work with Perl running on a Windows machine. This means its useless to most Perl programmers including me.
        I use Spreadsheet::ParseExcel all the time and wish there was an equivalent for MSWord docs.
Re: Parsing an MS Word document
by AgentM (Curate) on Dec 11, 2000 at 09:43 UTC
Re: Parsing an MS Word document
by Albannach (Monsignor) on Dec 11, 2000 at 09:44 UTC
    My but that structure doc looks like fun! I'll be happy to beta test your new module - let me know!

    Until then, and depending on the number of files you're talking about, you might be more quickly rewarded with results by trying to drive Word through OLE automation, which should at least give you access to read and write these fields. I've had good results automating Excel this way (I'll have to try it with Word some time too!).

    --
    I'd like to be able to assign to an luser

Re: Parsing an MS Word document
by hotyopa (Scribe) on Dec 11, 2000 at 07:18 UTC
    My apologies -> I didn't mean to log this one as Anonymous.
It's not going to be that hard to parse an MS Word document
by hotyopa (Scribe) on Dec 15, 2000 at 06:16 UTC
    It seems like the author of LAOLA has written a module called OLE::Storage, which can parse the Micro$oft OLE document structure (implemented as a directory-in-a-file).

    The upside to all this is that these property sets apply not only to Word, but to any OLE compliant program. Given the support that GNOME are building for OLE, might just end up being a v. useful little module indeed.

Re: Parsing an MS Word document
by hotyopa (Scribe) on Dec 12, 2000 at 11:14 UTC
    There are only a couple of issues I have with the Win32::OLE path:-
    1. I've tried going down the OLE path from within an Access database. It's time for a coffee break every time you want to read a set of document properties.
    2. I want to batch process these files overnight on a UNIX server, logging the document properties to a database as I went.

    I am undecided now whether I should bother at all. Back to the Visual Basic maybe :(

Re: Parsing an MS Word document
by fongsaiyuk (Pilgrim) on Dec 11, 2000 at 09:33 UTC
    Is is absolutely essential that you parse MSWord documents? How about MS Excel? Spreadsheet::ParseExcel provides some essential capabilities.

    You might be able to mudge the Word documents into Excel by doing a 'Save As' from Word to Excel... However I'm not certain as to the scope of your project if such a thing would be possible.

    A quick trollop through CPAN yielded no hits for 'Word' or 'MSWord' so you might consider a life of fame and fortune and be the first lucky hacker to write such a Parsing Module!! :) :)

      It has to be MS Word I'm afraid. I'm writing a document manager for our Japanese translation department and they store a lot of info in these doc properties.

Re: Parsing an MS Word document
by hotyopa (Scribe) on Dec 14, 2000 at 08:57 UTC

    After much digging about, it would seem that I am getting a little bit closer to a solution.

    It seems that there are some people over at the GNOME project writing a C library to handle OLE structured storage. The library is called libole2.

    Since Perl and C get on so well together, it would seem that I might be able to do it after all, with much less work than I originally thought. I will keep posting as I get further into it.

    Anton