in reply to Re: Repair malformed XML
in thread Repair malformed XML

I agree that this is largely a guess, but there is one relatively simple heuristic that might actually help this case. Well-formed XML documents may nest tags, but can't have an inner tag close after the enclosing tag. For example:

<document><text>Some text</text></document> <!-- Valid --> <document><text>Some text</document></text> <!-- INVALID -->

So, an algorithm that makes sure nested tags are closed before the enclosing tags is a good step, and if the sample above is representative such a step will likely go a long way toward solving the problem.

Anima Legato
.oO all things connect through the motion of the mind

Replies are listed 'Best First'.
Re^3: Repair malformed XML
by Anonymous Monk on Feb 03, 2005 at 16:49 UTC
    Yeah, but with that heuristics, one could immediately close any open tag that doesn't have a corresponding opening tag (and hence promoting them to empty elements). Or, by the same token, simply remove openings tag that don't have a corresponding closing tag (eliminating the element). Or you keep a stack of elements (push on open; pop on close), and if you encounter a closing tag that doesn't belong to the element on top of your stack, keep popping and closing till you find a correct one (implicite closing elements, like HTML's P, LI and TD elements).

    Any one could be right. Or wrong. Or right sometimes, and wrong at other times. You end up with a document that is "well-formed". It may be correct, but it may not. You don't know. If you leave the document unmodified, any parser will tell you it's incorrect. That might even be a better situation.

      Not quite. If the DTD tells you, for example, that element a may contain elements b, c, or d, and that b can contain e and f, then if it looks like element a contains one b, and two e's, you can be pretty sure that the b was close improperly (if at all), and the e's should be in b.

      There are still many possibilities for confusion. But a heuristic that started with the DTD could do quite a good job. I'm not going to pretend it would be easy and/or fun ... but in theory the information may be there that could do a good job - and, if the DTD does not allow overlaps (such as a and b both allowing d's, so that the d can either be a child or a grandchild of a), you may even be able to do a perfect job.

        Ah, I see. You mean (X)HTML, and 'a' is P, 'b' is SPAN, 'c' is CODE, 'd' is SMALL, 'e' is EM and 'f' is STRONG. So, now you encounter:
        <P>foo <SPAN> bar baz <EM> qux </EM> <EM> quux </EM> </P>
        Now, assuming tags aren't to be inserted inside words, I still can find five places to put in </SPAN>: before 'bar', between 'bar' and 'baz', after 'baz', between the two EM elements, and before the </P> tag.

        Now, if you have a DTD that says that the only possible content of a 'b' is exactly two 'e's, you know where the missing closing tag should have been.

        Note also that if you have a DTD where you can always unambigiously deduce where a missing closing tag should have gone, the closing tag is redundant - and if it were an SGML DTD instead of an XML DTD, the closing tag would have been optional. (And that would have solved the problem instantly - the document would be conforming).