DVCHAL has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I want to Read the Docx file in the Windows 7 platform. I did the google for this i am unable to find one. Lot many post directs to win::OLE, Any cpan modules available for this similar to Spreadsheet::WriteExcel to write Excel? If any, please share with me the usage, any help on this wouldbe grateful.
  • Comment on To Read and Edit docx files in Windows 7

Replies are listed 'Best First'.
Re: To Read and Edit docx files in Windows 7
by Laurent_R (Canon) on Dec 10, 2014 at 07:20 UTC
    Hmm, rather than the version of Windows, you should tell us the MS Office version you are using. Modern versions of MS Office produce files which are actually compressed (zipped) XML files. So you have to be prepared to edit XML files. I haven't tried it personally, but it seems that simply unzipping a .docx file makes it possible to access the individual XML files.

    Otherwise, you could also try to use MSWord::ToHTML and see if it fits your purposes.

      Thanks for the reply Laurent. I am using MS office 2010. I need to read and edit from .Docx files. As you told about edit XML files, I am not familiar with that. any Example or reference to work on .docx files would be of great help.
        I've just created a short Word document called December_12.docx and copied it on a Unix platform. Then made a copy of it called December_12.zip. Then, unzipping it shows this:
        $cp December_12.docx December_12.zip $unzip December_12.zip Archive: December_12.zip inflating: [Content_Types].xml inflating: _rels/.rels inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/theme/theme1.xml inflating: word/settings.xml inflating: word/webSettings.xml inflating: word/stylesWithEffects.xml inflating: docProps/core.xml inflating: word/styles.xml inflating: word/fontTable.xml inflating: docProps/app.xml
        Now you could in principle edit the word/document.xml document, except that the XML looks quite messy: The content of the Word document was only these two lines:
        December 12, 2014. The quick brown fox jumps over the lazy dog.
Re: To Read and Edit docx files in Windows 7
by RonW (Parson) on Dec 10, 2014 at 22:53 UTC

    Didn't find much in the way of Perl modules for working with MS Word files.

    Depending on just what needs to be done, maybe you could extract text from the document with Text::Extract::Word, edit the resulting text, then convert the text to HTML using HTML::FromText, saving the resulting HTML with a .doc extension. MS Word will read that without complaint.* Of course, the original formatting will be lost. (Though any formatting your Perl program puts in the HTML will be accepted by MS Word.)

    (There is Win32::Word::Writer, but it would be harder to use than what I suggested, above.)

    Alternately, there are tools for converting docx files to ODF files and several Perl modules for working with ODF files. Then your program can modify it and MSWord can read the result. (Supposedly, MSWord can also export to ODF - assuming you can convince your users to do that.)

    ---

    * I know this because, when I had a website, resume.doc was just a symbolic link to resume.html

      As you referred "Text::Extract::Word" is only to read from ".doc" file and its not working for ".docx" files. Any Similar modules to convert docx to text?
Re: To Read and Edit docx files in Windows 7
by SimonPratt (Friar) on Dec 11, 2014 at 09:25 UTC

    There is no need for a module, as you can use the OLE interface to interact directly with Word. Any module that wraps (or replaces) this functionality would have to be updated for each new version of Office that comes out.

    This code should get you started on the right track (though this is written for Word 2013, as that is what I have. You should be able to just change the object library version number to get it to work with your version of Word

    use 5.16.2; use Win32::OLE; use Win32::OLE::Const 'Microsoft Office 15.0 Object Library'; my $word = Win32::OLE->new( 'Word.Application', 'Quit' ); my $doc = $word->Documents->Open( 'C:\Temp\OLE\Word\test.docx' ) || d +ie 'Unable to open document: ', Win32::OLE->LastError; my $paragraphs = Win32::OLE::Enum->new( $doc->Paragraphs ); while ( defined( my $paragraph = $paragraphs->Next ) ) { my $words = Win32::OLE::Enum->new( $paragraph->{Range}->{Words} ); while ( defined( my $word = $words->Next ) ) { $word->{Text} =~ s/([Hh])i/$1ello/; } } $doc->Save; $doc->Close;

    You may also find this helpful: http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word%28v=office.14%29.aspx

      Hi Simon! Thanks for the reply, My code works but it throws Error as shown below ,

      Win32::OLE(0.1709) error 0x80010108: "The object invoked has disconnected from its clients" in METHOD/PROPERTYGET "Quit" at Author_doc_read_new.pl line 0 eval {...} called at Author_doc_read_new.pl line 0 eval {...} called at Author_doc_read_new.pl line 0

      What may be the cause of it? Below is my code snippet:
      use Win32::OLE; use Win32::OLE::Enum; use Win32::OLE::Const 'Microsoft Office 15.0 Object Library'; use Win32::OLE::Const 'Microsoft Word'; #$tm = localtime; #print "$tm\n"; #Create and Open the Text file to Write open(OUTFILE2,">Author_name_extract.txt") or die("Cant open Output fil +e\n"); ### open Word application and add an empty document ### (will die if Word not installed on your machine) my $word = Win32::OLE->new('Word.Application', 'Quit') or die; $word->{Visible} = 0; @filesnames = glob '*.docx'; #@filesnames = "AR765_Maint_Code_repositoryUINT32.docx"; foreach $count (@filesnames) #Loop till the end is reached { if($count !~ /^~\$/) { print "$count\n"; $filename = "D:\\MRJ_BCU\\Perl\\From thejaswini\\doc_read\\$co +unt"; #my $document = $word->Documents->open($filename) || die 'Unab +le to open document: ', Win32::OLE->LastError; my $document = $word->Documents->open($filename)|| die 'Unable + to open document:'; open(OUTFILE1,">File_under_Review.txt") or die("Cant open Outp +ut file\n"); print "Extracting Text from $filename...\n"; $paragraphs = $document->Paragraphs(); $enumerate = new Win32::OLE::Enum($paragraphs); while(defined($paragraph = $enumerate->Next())) { $a = $paragraph->{Range}->{Text}; print OUTFILE1 "$a\n"; } close(OUTFILE1); $document->Save; $document->Close; # Open the Converted Text file to read the Pattern. open(INFILE,"<File_under_Review.txt") or die("Can't open f +ile specified\n"); while($a = <INFILE>) { if($a !~ /\S/) { ; } else { $b = $a; if($a =~ /Date:/) { $a =~ /\s*\S*\s*Date:\s*(\d*\/\d*\/\d*)\s*/; $a= $1; $a =~ s/\s*//g; $a =~ s/_*//g; print OUTFILE2 "$count\t"; print OUTFILE2 "$a\t"; } if($b =~ /Review Moderator:/) { $b =~ /\s*\S*\s*Review Moderator:\s*(\w+\s?.?\w*)\ +s*Date:/; $b=$1; $b =~ s/\s{2,}//g; $b =~ s/_*//g; print OUTFILE2 "$b\n"; } } } close(INFILE); #To Delete the Temp converted text File unlink("File_under_Review.txt"); } else { print "Corrupted File: $count\n"; } } #To Quit the Word Application $word->Quit(); #close the Output text file used to write close(OUTFILE2); $tm = localtime; print "$tm\n";

        Firstly, I highly recommend using use strict; and use warnings; or use version; (replacing version with your Perl version number). Especially when using Win32::OLE.

        Next, I highly suspect $word->Quit(); is not necessary, as Word should already close when you close the last document you have open. This is (I think) the most likely source of the error you are experiencing.