Re: Parsing an MS Word document
by t0mas (Priest) on Dec 11, 2000 at 14:49 UTC
|
You could use Win32::OLE to make Word tell you. I did somthing like this to list all properties of all documents in my documents tree. Performance isn't to good though, about 3 documents per second on my sturdy old P233 with Win2k. Here's my code anyway. Hope it will help.
#!/usr/bin/perl -w
# Uses
use strict;
use Win32::OLE;
use Win32::OLE::Variant;
use Win32::OLE::Const;
use File::Find;
# We want to handle collections
Win32::OLE->Option(_NewEnum => 1);
# Variables
use vars qw($MSWord $wd $startdir);
# Where to start the doc search
$startdir='d:/documents';
# Create new MSWord object and load constants
$MSWord=Win32::OLE->new('Word.Application','Quit') or
die "Could not load MS Word";
$wd=Win32::OLE::Const->Load($MSWord);
# Find documents
find(\&getProps,$startdir);
######################################
sub getProps {
# Find sub
# We only want .doc files
return unless /\.doc/ && -f;
# No OLE warnings please
local $Win32::OLE::Warn = 0;
# Open document
my $doc = $MSWord->Documents->Open({FileName=>$File::Find::name});
# Exit nicely if we couldn't open doc
return unless $doc;
# Print header
print "\n-----------------------------------------------\n";
print "Document: $File::Find::name\n";
print "-----------------------------------------------\n";
# List document properties
foreach my $prop (@{$doc->BuiltInDocumentProperties->{_NewEnum}},
@{$doc->CustomDocumentProperties->{_NewEnum}}) {
if (defined $prop->{Name}) {
print $prop->{Name};
if (defined $prop->{Value}) {
# Variants...
if ($prop->{Value}=~/^Win32::OLE::Variant/) {
print ": ".valof $prop->{Value};
} else {
print ": ".$prop->{Value};
}
} else {
print ": undefined"
}
print "\n";
}
}
# Close document
$doc->Close({SaveChanges=>$wd->{wdDoNotSaveChanges}});
}
/brother t0mas
| [reply] [d/l] |
Re: Parsing an MS Word document
by clemburg (Curate) on Dec 11, 2000 at 14:56 UTC
|
You might have seen LAOLA,
but I doubt it will make you happy today.
Go with the advice of brother t0mas, and use Win32::OLE.
It will make you much more happy. There is a tutorial on Win32::OLE
online (by Jan Dubois), and there are also good articles
on it in previous issues of The Perl Journal.
Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com
| [reply] |
|
|
++ brother clemburg!
Perl Journal Rules!
To be honest there don't seem to be a great deal of modules out there that interface with Windows apps. There are so many Windows apps out there it's not really worth creating a module for it when OLE automation is straightforward once you know where to start.
SpreadSheet::Excel is great and it's obvious that the author went to a great deal of work to write the module without using OLE.
I've seen a lot of nodes here recently requesting help automating Windows tasks that are very suited to Win32::OLE. Maybe we could post some easy examples to the Code Catacombs?
On seconds thoughts - I will write a Module Review for Win32::OLE.
Update: Drat! Rudif has already written a pretty good module review for Win32::OLE. Would it hurt to write another, or would you recommend that I add a follow-on to his node with some more examples?
| [reply] |
|
|
How could it hurt to have more informed opinions
on such a useful module?
In terms of priorities, I think examples abound
in other references (like TPJ, books, module docs,
online tutorials, etc.). Maybe you could add
a summary of pointers to such information to your
review? It is hard to come up with good examples.
Chances are the work has already been done by others.
Why not just collect it into one place?
I think this would be a great service to all of us.
Anyway, thanks for your offer!
Christian Lemburg
Brainbench MVP for Perl
http://www.brainbench.com
| [reply] |
|
|
Win32::OLE will only work with Perl running on a Windows machine. This means its useless to most Perl programmers including me.
I use Spreadsheet::ParseExcel all the time and wish there was an equivalent for MSWord docs.
| [reply] |
Re: Parsing an MS Word document
by AgentM (Curate) on Dec 11, 2000 at 09:43 UTC
|
| [reply] |
Re: Parsing an MS Word document
by Albannach (Monsignor) on Dec 11, 2000 at 09:44 UTC
|
My but that structure doc looks like fun! I'll be happy to
beta test your new module - let me know!
Until then, and depending on the number of files you're
talking about, you might be more quickly rewarded with
results by trying to drive Word through OLE automation, which should at least
give you access to read and write these fields. I've had
good results automating Excel this way (I'll have to try it
with Word some time too!).
--
I'd like to be able to assign to an luser | [reply] |
Re: Parsing an MS Word document
by hotyopa (Scribe) on Dec 11, 2000 at 07:18 UTC
|
My apologies -> I didn't mean to log this one as Anonymous. | [reply] |
It's not going to be that hard to parse an MS Word document
by hotyopa (Scribe) on Dec 15, 2000 at 06:16 UTC
|
It seems like the author of LAOLA has written a module called OLE::Storage, which can parse the Micro$oft OLE document structure (implemented as a directory-in-a-file).
The upside to all this is that these property sets apply not only to Word, but to any OLE compliant program. Given the support that GNOME are building for OLE, might just end up being a v. useful little module indeed. | [reply] |
Re: Parsing an MS Word document
by hotyopa (Scribe) on Dec 12, 2000 at 11:14 UTC
|
| [reply] |
Re: Parsing an MS Word document
by fongsaiyuk (Pilgrim) on Dec 11, 2000 at 09:33 UTC
|
Is is absolutely essential that you parse MSWord documents?
How about MS Excel? Spreadsheet::ParseExcel provides some essential capabilities.
You might be able to mudge the Word documents into Excel by doing a 'Save As' from
Word to Excel... However I'm not certain as to the scope of your project if such a
thing would be possible.
A quick trollop through CPAN yielded no hits for 'Word' or 'MSWord' so you might consider a life
of fame and fortune and be the first lucky hacker
to write such a Parsing Module!! :) :)
| [reply] |
|
|
| [reply] |
Re: Parsing an MS Word document
by hotyopa (Scribe) on Dec 14, 2000 at 08:57 UTC
|
After much digging about, it would seem that I am getting a little bit closer to a solution.
It seems that there are some people over at the GNOME project writing a C library to handle OLE structured storage. The library is called libole2.
Since Perl and C get on so well together, it would seem that I might be able to do it after all, with much less work than I originally thought. I will keep posting as I get further into it.
Anton
| [reply] |