Re^3: To Read and Edit docx files in Windows 7

I've just created a short Word document called December_12.docx and copied it on a Unix platform. Then made a copy of it called December_12.zip. Then, unzipping it shows this:

$cp December_12.docx December_12.zip
$unzip December_12.zip
Archive:  December_12.zip
  inflating: [Content_Types].xml
  inflating: _rels/.rels
  inflating: word/_rels/document.xml.rels
  inflating: word/document.xml
  inflating: word/theme/theme1.xml
  inflating: word/settings.xml
  inflating: word/webSettings.xml
  inflating: word/stylesWithEffects.xml
  inflating: docProps/core.xml
  inflating: word/styles.xml
  inflating: word/fontTable.xml
  inflating: docProps/app.xml
[download]

Now you could in principle edit the word/document.xml document, except that the XML looks quite messy:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/w
+ordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/mark
+up-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:offi
+ce" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/re
+lationships" xmlns:m="http://schemas.openxmlformats.org/officeDocumen
+t/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http
+://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmln
+s:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessing
+Drawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="h
+ttp://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w1
+4="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="h
+ttp://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xml
+ns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingI
+nk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" 
+xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessi
+ngShape" mc:Ignorable="w14 wp14"><w:body><w:p w:rsidR="006C10F5" w:rs
+idRDefault="00C4263D"/><w:p w:rsidR="00C4263D" w:rsidRPr="00C4263D" w
+:rsidRDefault="00C4263D"><w:pPr><w:rPr><w:lang w:val="en-US"/></w:rPr
+></w:pPr><w:r w:rsidRPr="00C4263D"><w:rPr><w:lang w:val="en-US"/></w:
+rPr><w:t>December 12, 2014.</w:t></w:r></w:p><w:p w:rsidR="00C4263D" 
+w:rsidRPr="00C4263D" w:rsidRDefault="00C4263D"><w:pPr><w:rPr><w:lang 
+w:val="en-US"/></w:rPr></w:pPr><w:proofErr w:type="gramStart"/><w:r w
+:rsidRPr="00C4263D"><w:rPr><w:lang w:val="en-US"/></w:rPr><w:t xml:sp
+ace="preserve">The quick brown </w:t></w:r><w:proofErr w:type="spellS
+tart"/><w:r w:rsidRPr="00C4263D"><w:rPr><w:lang w:val="en-US"/></w:rP
+r><w:t>fox</w:t></w:r><w:proofErr w:type="spellEnd"/><w:r w:rsidRPr="
+00C4263D"><w:rPr><w:lang w:val="en-US"/></w:rPr><w:t xml:space="prese
+rve"> jumps over the lazy dog.</w:t></w:r><w:proofErr w:type="gramEnd
+"/></w:p><w:p w:rsidR="00C4263D" w:rsidRDefault="00C4263D"><w:pPr><w:
+rPr><w:lang w:val="en-US"/></w:rPr></w:pPr></w:p><w:p w:rsidR="00C426
+3D" w:rsidRPr="00C4263D" w:rsidRDefault="00C4263D"><w:pPr><w:rPr><w:l
+ang w:val="en-US"/></w:rPr></w:pPr><w:bookmarkStart w:id="0" w:name="
+_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr w:rsidR="00C4263D"
+ w:rsidRPr="00C4263D"><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top
+="1417" w:right="1417" w:bottom="1417" w:left="1417" w:header="708" w
+:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:lineP
+itch="360"/></w:sectPr></w:body></w:document>
[download]

The content of the Word document was only these two lines:

December 12, 2014.


The quick brown fox jumps over the lazy dog.
[download]

Comment on Re^3: To Read and Edit docx files in Windows 7 Select or Download Code

Replies are listed 'Best First'.
Re^4: To Read and Edit docx files in Windows 7 by DVCHAL (Novice) on Dec 11, 2014 at 07:05 UTC
Thanks for the Sample Laurent. Any way to Extract the Content from XML file through Perl Script? In your Example, How to Extract only "The quick brown fox jumps over the lazy dog" through the perl Script from the messy XML file. Even if its a Table, whether we able to read in XML?	[reply]
Re^5: To Read and Edit docx files in Windows 7 by Anonymous Monk on Dec 11, 2014 at 07:34 UTC
use XML::LibXML with tools like xpather.pl/htmltreexpather.pl which can give you paths to start with, and all the links here Re: Retrieve select information from HTML, they're examples(for tree-xpath and others)/walkthroughs/tutorials ... Re: How to grab a portion of file with regex (don't)(parsing html/xml with xpath/twig/dom, because html::parser is low level), Re: How to grab a portion of file with regex (parsing html/xml with xpath/twig/dom, because xml::parser is low level), Re^4: How to grab a portion of file with regex (parsing html/xml with xpath/twig/dom, because ::parser is low level)	[reply]