in reply to How can I read the .docx file in perl?

prabuvos:

Unzip it ... it's just a collection of XML files. Then you can dig through them with your favorite XML parser. For many documents, you'll need only the items in the word directory.

$ unzip hack.docx Archive: hack.docx inflating: [Content_Types].xml inflating: _rels/.rels inflating: word/_rels/document.xml.rels inflating: word/document.xml inflating: word/_rels/header2.xml.rels inflating: word/footer2.xml inflating: word/footer1.xml inflating: word/footer3.xml inflating: word/header2.xml inflating: word/header1.xml inflating: word/endnotes.xml inflating: word/footnotes.xml inflating: word/header3.xml extracting: word/media/image1.jpeg inflating: word/theme/theme1.xml inflating: word/_rels/settings.xml.rels inflating: word/settings.xml inflating: word/styles.xml inflating: word/webSettings.xml inflating: word/numbering.xml inflating: docProps/app.xml inflating: docProps/core.xml inflating: word/fontTable.xml inflating: docProps/custom.xml

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: How can I read the .docx file in perl?
by locked_user sundialsvc4 (Abbot) on Apr 17, 2013 at 13:35 UTC

    In addition, Microsoft provides XML schemas, e.g. here, by which the contents of the file can be validated – also used in some forms of extraction.

    If you use an “industrial strength” package such as XML::LibXML, which is based on the industry-standard libxml2 library, you will get all the goodies that you need.

    IIRC, Microsoft was told a few years ago by several governments that a “closed” format was no longer acceptable for government documents ... a very sensible concern, of course.   Of course, ODF is also an XML-based format.   See http://oasis-open.org.