Re: To Read and Edit docx files in Windows 7
by Laurent_R (Canon) on Dec 10, 2014 at 07:20 UTC
|
Hmm, rather than the version of Windows, you should tell us the MS Office version you are using. Modern versions of MS Office produce files which are actually compressed (zipped) XML files. So you have to be prepared to edit XML files. I haven't tried it personally, but it seems that simply unzipping a .docx file makes it possible to access the individual XML files.
Otherwise, you could also try to use MSWord::ToHTML and see if it fits your purposes.
| [reply] |
|
|
Thanks for the reply Laurent. I am using MS office 2010. I need to read and edit from .Docx files. As you told about edit XML files, I am not familiar with that. any Example or reference to work on .docx files would be of great help.
| [reply] |
|
|
I've just created a short Word document called December_12.docx and copied it on a Unix platform. Then made a copy of it called December_12.zip. Then, unzipping it shows this:
$cp December_12.docx December_12.zip
$unzip December_12.zip
Archive: December_12.zip
inflating: [Content_Types].xml
inflating: _rels/.rels
inflating: word/_rels/document.xml.rels
inflating: word/document.xml
inflating: word/theme/theme1.xml
inflating: word/settings.xml
inflating: word/webSettings.xml
inflating: word/stylesWithEffects.xml
inflating: docProps/core.xml
inflating: word/styles.xml
inflating: word/fontTable.xml
inflating: docProps/app.xml
Now you could in principle edit the word/document.xml document, except that the XML looks quite messy:
The content of the Word document was only these two lines:
December 12, 2014.
The quick brown fox jumps over the lazy dog.
| [reply] [d/l] [select] |
|
|
|
|
Re: To Read and Edit docx files in Windows 7
by RonW (Parson) on Dec 10, 2014 at 22:53 UTC
|
Didn't find much in the way of Perl modules for working with MS Word files.
Depending on just what needs to be done, maybe you could extract text from the document with Text::Extract::Word, edit the resulting text, then convert the text to HTML using HTML::FromText, saving the resulting HTML with a .doc extension. MS Word will read that without complaint.* Of course, the original formatting will be lost. (Though any formatting your Perl program puts in the HTML will be accepted by MS Word.)
(There is Win32::Word::Writer, but it would be harder to use than what I suggested, above.)
Alternately, there are tools for converting docx files to ODF files and several Perl modules for working with ODF files. Then your program can modify it and MSWord can read the result. (Supposedly, MSWord can also export to ODF - assuming you can convince your users to do that.)
---
* I know this because, when I had a website, resume.doc was just a symbolic link to resume.html
| [reply] [d/l] [select] |
|
|
As you referred "Text::Extract::Word" is only to read from ".doc" file and its not working for ".docx" files. Any Similar modules to convert docx to text?
| [reply] |
|
|
| [reply] |
Re: To Read and Edit docx files in Windows 7
by SimonPratt (Friar) on Dec 11, 2014 at 09:25 UTC
|
There is no need for a module, as you can use the OLE interface to interact directly with Word. Any module that wraps (or replaces) this functionality would have to be updated for each new version of Office that comes out.
This code should get you started on the right track (though this is written for Word 2013, as that is what I have. You should be able to just change the object library version number to get it to work with your version of Word
use 5.16.2;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Office 15.0 Object Library';
my $word = Win32::OLE->new( 'Word.Application', 'Quit' );
my $doc = $word->Documents->Open( 'C:\Temp\OLE\Word\test.docx' ) || d
+ie 'Unable to open document: ', Win32::OLE->LastError;
my $paragraphs = Win32::OLE::Enum->new( $doc->Paragraphs );
while ( defined( my $paragraph = $paragraphs->Next ) ) {
my $words = Win32::OLE::Enum->new( $paragraph->{Range}->{Words} );
while ( defined( my $word = $words->Next ) ) {
$word->{Text} =~ s/([Hh])i/$1ello/;
}
}
$doc->Save;
$doc->Close;
You may also find this helpful: http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word%28v=office.14%29.aspx | [reply] [d/l] |
|
|
use Win32::OLE;
use Win32::OLE::Enum;
use Win32::OLE::Const 'Microsoft Office 15.0 Object Library';
use Win32::OLE::Const 'Microsoft Word';
#$tm = localtime;
#print "$tm\n";
#Create and Open the Text file to Write
open(OUTFILE2,">Author_name_extract.txt") or die("Cant open Output fil
+e\n");
### open Word application and add an empty document
### (will die if Word not installed on your machine)
my $word = Win32::OLE->new('Word.Application', 'Quit') or die;
$word->{Visible} = 0;
@filesnames = glob '*.docx';
#@filesnames = "AR765_Maint_Code_repositoryUINT32.docx";
foreach $count (@filesnames) #Loop till the end is reached
{
if($count !~ /^~\$/)
{
print "$count\n";
$filename = "D:\\MRJ_BCU\\Perl\\From thejaswini\\doc_read\\$co
+unt";
#my $document = $word->Documents->open($filename) || die 'Unab
+le to open document: ', Win32::OLE->LastError;
my $document = $word->Documents->open($filename)|| die 'Unable
+ to open document:';
open(OUTFILE1,">File_under_Review.txt") or die("Cant open Outp
+ut file\n");
print "Extracting Text from $filename...\n";
$paragraphs = $document->Paragraphs();
$enumerate = new Win32::OLE::Enum($paragraphs);
while(defined($paragraph = $enumerate->Next()))
{
$a = $paragraph->{Range}->{Text};
print OUTFILE1 "$a\n";
}
close(OUTFILE1);
$document->Save;
$document->Close;
# Open the Converted Text file to read the Pattern.
open(INFILE,"<File_under_Review.txt") or die("Can't open f
+ile specified\n");
while($a = <INFILE>)
{
if($a !~ /\S/)
{
;
}
else
{
$b = $a;
if($a =~ /Date:/)
{
$a =~ /\s*\S*\s*Date:\s*(\d*\/\d*\/\d*)\s*/;
$a= $1;
$a =~ s/\s*//g;
$a =~ s/_*//g;
print OUTFILE2 "$count\t";
print OUTFILE2 "$a\t";
}
if($b =~ /Review Moderator:/)
{
$b =~ /\s*\S*\s*Review Moderator:\s*(\w+\s?.?\w*)\
+s*Date:/;
$b=$1;
$b =~ s/\s{2,}//g;
$b =~ s/_*//g;
print OUTFILE2 "$b\n";
}
}
}
close(INFILE);
#To Delete the Temp converted text File
unlink("File_under_Review.txt");
}
else
{
print "Corrupted File: $count\n";
}
}
#To Quit the Word Application
$word->Quit();
#close the Output text file used to write
close(OUTFILE2);
$tm = localtime;
print "$tm\n";
| [reply] [d/l] |
|
|
Firstly, I highly recommend using use strict; and use warnings; or use version; (replacing version with your Perl version number). Especially when using Win32::OLE.
Next, I highly suspect $word->Quit(); is not necessary, as Word should already close when you close the last document you have open. This is (I think) the most likely source of the error you are experiencing.
| [reply] [d/l] [select] |
|
|
|
|
|