How to Truncate Corrupt Document.xml Files?

socrtwo has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks. My apologies if I'm offending anybody by asking for help too early in the development of the solution for this issue.

I'm working with corrupt MS Word Office Open docx format files. These are files in which the document.xml part where the document's text resides is corrupt, often because they are truncated randomly.

By rough experimentation I have found that if I clean out a partial tag at the end of a the xml file and then add </w:t></w:r></w:p></w:body></w:document> and then rezip whatever was recoverable from the zip structure using a corruption tolerant unzipper like CakeCMD or no-frills unzipper, I can get MS Word to open the file.

I found with one file where the truncation occurred in the middle of a table, that I needed to add instead </w:t></w:r></w:p></w:tc></w:tr></w:tbl></w:body></w:document>. Then I rezipped the files and Word again could open it.

So basically now my holy grail is a generalized solution Perl script which can process document.xml files and truncate the file just before the first XML error and then add the appropriate XML closing tags to not offend MS Word 2007 or 2010. I thought one way was to make a list of all non-self closing tags in the document.xml and then step backwards looking for the first instances of those unclosed tags and adding their respective closing tags in the order that the unclosed tags were encountered.

I had a look at CPAN and nothing jumped out at me about how to find the first error in an XML file, nor how to truncate an XML fil not to speak of walking back an XML file looking for unclosed tags and closing them in the order found. I did see elsewhere that a regular expression <[^<>]+[^/]> would return those opening tags that are not self closing. I know I'm fishing for some code from you all but just some help as to what CPAN module to use and maybe a better overall idea about how to approach this programmatically would be nice.

I was looking at PHP too, however I think I want to do this in PERL and then compile it for use both in my free MS Office service (which now just extracts text, and doesn't return full Word files) and a planned open source VB.NET Word recovery program

Comment on How to Truncate Corrupt Document.xml Files? Select or Download Code

Replies are listed 'Best First'.
Re: How to Truncate Corrupt Document.xml Files? by educated_foo (Vicar) on Feb 16, 2012 at 01:27 UTC
I would start by using a streaming (SAX) parser and maintaining a stack of unclosed tags. Have you tried that yet?	[reply]
Re^2: How to Truncate Corrupt Document.xml Files? by socrtwo (Sexton) on Feb 16, 2012 at 02:11 UTC
I haven't tried that yet. Thanks for heads up. I'm looking at streaming SAX parsing now. I see the Ruby Gem Nokogiri may be well suited for this but there are a lot of SAX modules in Perl and I don't know anything about Ruby at the moment, but I know a little of Perl.	[reply]
Re^3: How to Truncate Corrupt Document.xml Files? by educated_foo (Vicar) on Feb 16, 2012 at 02:28 UTC
I don't parse much XML (thank God), but XML::Parser (originally written by Larry Wall) has always been pretty straightforward to use -- just define `Start()` and `End()` handlers for a start.	[reply] [d/l] [select]
Re^4: How to Truncate Corrupt Document.xml Files? by socrtwo (Sexton) on Feb 16, 2012 at 04:16 UTC
Re^5: How to Truncate Corrupt Document.xml Files? by educated_foo (Vicar) on Feb 16, 2012 at 04:36 UTC
Some notes below your chosen depth have not been shown here