in reply to Extracting information from a MS WORD Document

What platform are you running Perl on? If it's on a Windows machine, and MS Word is installed, you can do this fairly easily by having Perl start Word and control it via OLE. I can show you some example code if it's helpful.

On the other hand, if you need to run the script on a UNIX/Linux box you are probably out of luck in native Perl. As far as I know there is no Perl module which can parse Word .docs. The format is *very* hairy.

You might have success calling word2x from Perl and parsing the output.

  • Comment on Re: Extracting information from a MS WORD Document

Replies are listed 'Best First'.
Re^2: Extracting information from a MS WORD Document
by Saran2 (Novice) on Jan 14, 2008 at 20:13 UTC
    Hi Thanks for your reply. I am running it on a windows machine. Please show/send me the example code. Thanks Saran
      Here is some code that *might* work for you. I say *might* because I do not have win32 perl installed where I am, and my download is stuck at 1% ;) So take this for what it's worth; this is a starting point for your code. This is cribbed together from some win32 code I have in my home directory.
      use Win32::OLE; use Win32::OLE::Enum; # this $text will hold all the text of all the word docs my $text=""; # this is how we start Word from Perl my $word=Win32::OLE->new("Word.Application"); # for every file in the current directory foreach my $filename (<*.doc>) { # tell Word to open the file we just found my $doc=$word->Documents->Open($filename); # get an object representing all the paragraphs in the doc my $paragraphs=new Win32::OLE::Enum($doc->Paragraphs()); # for every paragraph... while(defined($paragraph = $paragraphs->Next())) { # append the paragraph text to $text $text.=$paragraph->{Range}->{Text}; } } foreach my $line (split(/\r\n/,$text)) { # this bit DEFINITELY doesn't work! if (/some pattern/) { ..do something.. } }
      A reply falls below the community's threshold of quality. You may see it by logging in.