Saran2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I have to extract the following information from a MS Word Document with .doc as the extension. "Test Script Name :Verify Instrument Group Management Display Changes" "Script Description: This script will verify the display changes on the Instrument Group Management page" and store it in a .csv file , the above lines may occur in many places in the word document but with the same format. Please help out. Thanks Saran
  • Comment on Extracting information from a MS WORD Document

Replies are listed 'Best First'.
Re: Extracting information from a MS WORD Document
by thparkth (Beadle) on Jan 14, 2008 at 20:03 UTC
    What platform are you running Perl on? If it's on a Windows machine, and MS Word is installed, you can do this fairly easily by having Perl start Word and control it via OLE. I can show you some example code if it's helpful.

    On the other hand, if you need to run the script on a UNIX/Linux box you are probably out of luck in native Perl. As far as I know there is no Perl module which can parse Word .docs. The format is *very* hairy.

    You might have success calling word2x from Perl and parsing the output.

      Hi Thanks for your reply. I am running it on a windows machine. Please show/send me the example code. Thanks Saran
        Here is some code that *might* work for you. I say *might* because I do not have win32 perl installed where I am, and my download is stuck at 1% ;) So take this for what it's worth; this is a starting point for your code. This is cribbed together from some win32 code I have in my home directory.
        use Win32::OLE; use Win32::OLE::Enum; # this $text will hold all the text of all the word docs my $text=""; # this is how we start Word from Perl my $word=Win32::OLE->new("Word.Application"); # for every file in the current directory foreach my $filename (<*.doc>) { # tell Word to open the file we just found my $doc=$word->Documents->Open($filename); # get an object representing all the paragraphs in the doc my $paragraphs=new Win32::OLE::Enum($doc->Paragraphs()); # for every paragraph... while(defined($paragraph = $paragraphs->Next())) { # append the paragraph text to $text $text.=$paragraph->{Range}->{Text}; } } foreach my $line (split(/\r\n/,$text)) { # this bit DEFINITELY doesn't work! if (/some pattern/) { ..do something.. } }
        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Extracting information from a MS WORD Document
by rreck (Initiate) on Jan 16, 2008 at 16:46 UTC
    If it were me, I would probably extract text using Antiword. http://www.winfield.demon.nl/