Re^2: Repair malformed XML

I agree that this is largely a guess, but there is one relatively simple heuristic that might actually help this case. Well-formed XML documents may nest tags, but can't have an inner tag close after the enclosing tag. For example:

<document><text>Some text</text></document>  <!-- Valid   -->
<document><text>Some text</document></text>  <!-- INVALID -->
[download]

So, an algorithm that makes sure nested tags are closed before the enclosing tags is a good step, and if the sample above is representative such a step will likely go a long way toward solving the problem.

Anima Legato
.oO all things connect through the motion of the mind

Comment on Re^2: Repair malformed XML Download Code

Replies are listed 'Best First'.
Re^3: Repair malformed XML by Anonymous Monk on Feb 03, 2005 at 16:49 UTC
Yeah, but with that heuristics, one could immediately close any open tag that doesn't have a corresponding opening tag (and hence promoting them to empty elements). Or, by the same token, simply remove openings tag that don't have a corresponding closing tag (eliminating the element). Or you keep a stack of elements (push on open; pop on close), and if you encounter a closing tag that doesn't belong to the element on top of your stack, keep popping and closing till you find a correct one (implicite closing elements, like HTML's `P`, `LI` and `TD` elements). Any one could be right. Or wrong. Or right sometimes, and wrong at other times. You end up with a document that is "well-formed". It may be correct, but it may not. You don't know. If you leave the document unmodified, any parser will tell you it's incorrect. That might even be a better situation.	[reply]
Re^4: Repair malformed XML by Tanktalus (Canon) on Feb 03, 2005 at 17:22 UTC
Not quite. If the DTD tells you, for example, that element a may contain elements b, c, or d, and that b can contain e and f, then if it looks like element a contains one b, and two e's, you can be pretty sure that the b was close improperly (if at all), and the e's should be in b. There are still many possibilities for confusion. But a heuristic that started with the DTD could do quite a good job. I'm not going to pretend it would be easy and/or fun ... but in theory the information may be there that could do a good job - and, if the DTD does not allow overlaps (such as a and b both allowing d's, so that the d can either be a child or a grandchild of a), you may even be able to do a perfect job.	[reply]
Re^5: Repair malformed XML by Anonymous Monk on Feb 03, 2005 at 17:38 UTC
Ah, I see. You mean (X)HTML, and 'a' is `P`, 'b' is `SPAN`, 'c' is `CODE`, 'd' is `SMALL`, 'e' is `EM` and 'f' is `STRONG`. So, now you encounter: `<P>foo <SPAN> bar baz <EM> qux </EM> <EM> quux </EM> </P>` [download] Now, assuming tags aren't to be inserted inside words, I still can find five places to put in `</SPAN>`: before 'bar', between 'bar' and 'baz', after 'baz', between the two `EM` elements, and before the `</P>` tag. Now, if you have a DTD that says that the only possible content of a 'b' is exactly two 'e's, you know where the missing closing tag should have been. Note also that if you have a DTD where you can always unambigiously deduce where a missing closing tag should have gone, the closing tag is redundant - and if it were an SGML DTD instead of an XML DTD, the closing tag would have been optional. (And that would have solved the problem instantly - the document would be conforming).	[reply] [d/l]