in reply to Repair malformed XML

Well, reporting missing closing tags is trivial. Just for each type of element, count the number of opening tags, and the number of closing tags. If they are equal, no closing tags are missing (assuming no openings tags are missing). Else, the difference is the number of closing tags missing.

As for repairing -- without a DTD, it's going to be heuristics. And I'm not going to suggest any heuristics based on a tiny sample (643 bytes out of 80 Mb, about 0.00077%) of the file.

Replies are listed 'Best First'.
Re^2: Repair malformed XML
by legato (Monk) on Feb 03, 2005 at 16:23 UTC

    I agree that this is largely a guess, but there is one relatively simple heuristic that might actually help this case. Well-formed XML documents may nest tags, but can't have an inner tag close after the enclosing tag. For example:

    <document><text>Some text</text></document> <!-- Valid --> <document><text>Some text</document></text> <!-- INVALID -->

    So, an algorithm that makes sure nested tags are closed before the enclosing tags is a good step, and if the sample above is representative such a step will likely go a long way toward solving the problem.

    Anima Legato
    .oO all things connect through the motion of the mind

      Yeah, but with that heuristics, one could immediately close any open tag that doesn't have a corresponding opening tag (and hence promoting them to empty elements). Or, by the same token, simply remove openings tag that don't have a corresponding closing tag (eliminating the element). Or you keep a stack of elements (push on open; pop on close), and if you encounter a closing tag that doesn't belong to the element on top of your stack, keep popping and closing till you find a correct one (implicite closing elements, like HTML's P, LI and TD elements).

      Any one could be right. Or wrong. Or right sometimes, and wrong at other times. You end up with a document that is "well-formed". It may be correct, but it may not. You don't know. If you leave the document unmodified, any parser will tell you it's incorrect. That might even be a better situation.

        Not quite. If the DTD tells you, for example, that element a may contain elements b, c, or d, and that b can contain e and f, then if it looks like element a contains one b, and two e's, you can be pretty sure that the b was close improperly (if at all), and the e's should be in b.

        There are still many possibilities for confusion. But a heuristic that started with the DTD could do quite a good job. I'm not going to pretend it would be easy and/or fun ... but in theory the information may be there that could do a good job - and, if the DTD does not allow overlaps (such as a and b both allowing d's, so that the d can either be a child or a grandchild of a), you may even be able to do a perfect job.

Re^2: Repair malformed XML
by spoulson (Beadle) on Feb 03, 2005 at 15:58 UTC
    If I reverse engineered a DTD, would my chances of earning my XML repair badge be better? What module is capable of validating against DTD to identify a dropped tag like this?
      If I reverse engineered a DTD, would my chances of earning my XML repair badge be better?
      Maybe. That will depend on the DTD. But how do you know that what you reverse engineer is correct? Or perhaps you reverse engineer a DTD (which may, or may not) be correct, and allows non-ambigious repairs. (That's not so far-fetched. Consider an HTML or XHTML document with the some of the </EM> tags missing. It will not always be clear where to insert the missing tags, even if you assume they belong just before or after some other tag).

      One disadvantage of attempting to repair, and not knowing how to recognize a correct document, is that you may end up with a document that is well-formed, or even conforming to the DTD you have, or reversed engineered, is that you do not know whether you ended up with the right document.

      Consider a Perl program of which a quote is missing. You could write a "repair" program that noticed a quote is missing, and puts a quote back into the program. Now, if you just randomly inserted the quote in the program, you're likely to end up with a program that still doesn't compile. But for most programs that are missing a quote, there will be more than one place the quote can be inserted, and you still have a compilable program. Which one should your repair program pick? How does it now it's right?