I don't think your algorithm works. Yes, it will create a well-formed XML document, but that's not the same as repairing the document. Consider the following piece of (X)HTML:
<P> foo <SPAN> bar baz <EM> qux </EM> <EM> quux </EM> </P>
The </SPAN> tag is missing. Your algorithm will place it right in front of the </P>. It will repair the document to well-formedness (and in the case of (X)HTML, even to a valid document). But you don't know whether the </SPAN> really belongs there. Perhaps only the 'bar' was supposed to be inside the SPAN. Or maybe the first, but not the second, EM element belonged. Or perhaps it was a special DTD, that doesn't allow EM to appear inside SPAN. Then placing </SPAN> before </P> would be very wrong.

If you have no way of verifying the result is correct - heck, you can't even verify whether the resulting document is syntactically valid - I'd advice you to leave the document as is. Then even the most basic check (for well-formedness) will flag the document to be incorrect. Otherwise, you end up with a document that appears to be correct, but you've no way of knowing. Of course, that raises the question, if you don't have the DTD, how useful is the document, and why is it being considered for "repair"?


In reply to Re^2: Repair malformed XML by Anonymous Monk
in thread Repair malformed XML by spoulson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.