Hi Monks. My apologies if I'm offending anybody by asking for help too early in the development of the solution for this issue.

I'm working with corrupt MS Word Office Open docx format files. These are files in which the document.xml part where the document's text resides is corrupt, often because they are truncated randomly.

By rough experimentation I have found that if I clean out a partial tag at the end of a the xml file and then add </w:t></w:r></w:p></w:body></w:document> and then rezip whatever was recoverable from the zip structure using a corruption tolerant unzipper like CakeCMD or no-frills unzipper, I can get MS Word to open the file.

I found with one file where the truncation occurred in the middle of a table, that I needed to add instead </w:t></w:r></w:p></w:tc></w:tr></w:tbl></w:body></w:document>. Then I rezipped the files and Word again could open it.

So basically now my holy grail is a generalized solution Perl script which can process document.xml files and truncate the file just before the first XML error and then add the appropriate XML closing tags to not offend MS Word 2007 or 2010. I thought one way was to make a list of all non-self closing tags in the document.xml and then step backwards looking for the first instances of those unclosed tags and adding their respective closing tags in the order that the unclosed tags were encountered.

I had a look at CPAN and nothing jumped out at me about how to find the first error in an XML file, nor how to truncate an XML fil not to speak of walking back an XML file looking for unclosed tags and closing them in the order found. I did see elsewhere that a regular expression  <[^<>]+[^/]> would return those opening tags that are not self closing. I know I'm fishing for some code from you all but just some help as to what CPAN module to use and maybe a better overall idea about how to approach this programmatically would be nice.

I was looking at PHP too, however I think I want to do this in PERL and then compile it for use both in my free MS Office service (which now just extracts text, and doesn't return full Word files) and a planned open source VB.NET Word recovery program


In reply to How to Truncate Corrupt Document.xml Files? by socrtwo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.