in reply to Repair malformed XML

Given the evidence you've shown (some examples where the markup comes out right as well as a case where it's wrong because of a missing close tag), I think there's sufficient cause to to assume that, even without an "official" DTD, you can figure out where to put the close tag. (The nature of the XML generation bug appears to be constrained in a way that allows a "heuristic, speculative" solution to do the right thing.)

So here's how I would do it: read through the file one tag at a time (set the input-record-separator to ">"), maintain a stack of the open tags as they occur, pop them off the stack as their corresponding close-tags appear, and fill in missing close-tags where necessary.

(This way of reading might be noticeably inefficient on large data sets, and there are ways to chunk through the data with fewer iterations on the diamond operator; but see if it works well enough before thinking about optimizing it.)

#!/usr/bin/perl use strict; use warnings; $/ = '>'; # read up to the end of one tag at a time my $lineno = 0; my @tagstack; while (<>) { $lineno += tr/\n//; unless ( />$/ ) { # must be at eof; print; next; } my ( $pre, $tag ) = ( /([^<]*)(<[^>]+?)>/ ); if ( !defined( $tag )) { warn "'>' without prior '<' at or near line $lineno\n"; print; } elsif ( $tag =~ m{^</} ) { # close tag: look for its open tag on t +he stack unless ( @tagstack ) { warn "extra close tag '$tag' at or near line $lineno\n"; print; next; } $tag = substr $tag, 2; my $stackindx = $#tagstack; while ( $stackindx >= 0 and $tagstack[$stackindx] ne $tag ) { $stackindx--; } if ( $stackindx < 0 ) { warn "close tag '$tag' lacks open tag at or near line $lin +eno\n"; print; next; } print $pre; if ( $stackindx != $#tagstack ) { # add close tags as needed while ( $stackindx < $#tagstack ) { warn "added '</$tagstack[$#tagstack]>' at line $lineno +\n"; printf "</%s>\n", pop @tagstack; $lineno++; } } print "</$tag>"; pop @tagstack; } elsif ( $tag =~ m{^<!} or $tag =~ m{/$} ) { # "comment" or empty +tag print; } else { # this must be an open tag -- push it on the stack $tag =~ s/<(\S+).*/$1/s; push @tagstack, $tag; print; } }
I tested this on the examples you gave -- the good ones came out unaltered, and the bad one had the close-tag added where needed.

Replies are listed 'Best First'.
Re^2: Repair malformed XML
by mirod (Canon) on Feb 04, 2005 at 09:11 UTC

    So really you want to write a quasi-XML parser. The problem is that it doesn't parse enough of XML : if you look at the data, you will see a lot of CDATA sections. This means that when you use > as the input record separator you are likely hit one in the middle of the CDATA, if you come accross a filename that includes a '>';. A filename like /Documents/some file><.pdf will trip your code.

    So if you want your hand-rolled parser to really work you will have to take into account that case. This can be done of course, you will have to take the string you have read, remove complete CDATA sections from it, and then figure out whether you are still in a CDATA section.

    My point is that it is not easy to deal with even that rather simple case. You end up having to write something that closer to a real XML parser. Actually something more tricky than a real XML parser, as the XML spec clearly states that parsers can die after they find any error in the XML. So you are now trying to write a recovering XML parser... or you could just use libxml's one, I am sure Daniel Veillard has spent more time working on this than any one here would ;--)

      No, he wants to write an XML tokenizer. Which would do the trick - that is, that will implement his algorithm. (An algorith of which no garantees can be made to be correct).
Re^2: Repair malformed XML
by Anonymous Monk on Feb 04, 2005 at 11:38 UTC
    I don't think your algorithm works. Yes, it will create a well-formed XML document, but that's not the same as repairing the document. Consider the following piece of (X)HTML:
    <P> foo <SPAN> bar baz <EM> qux </EM> <EM> quux </EM> </P>
    The </SPAN> tag is missing. Your algorithm will place it right in front of the </P>. It will repair the document to well-formedness (and in the case of (X)HTML, even to a valid document). But you don't know whether the </SPAN> really belongs there. Perhaps only the 'bar' was supposed to be inside the SPAN. Or maybe the first, but not the second, EM element belonged. Or perhaps it was a special DTD, that doesn't allow EM to appear inside SPAN. Then placing </SPAN> before </P> would be very wrong.

    If you have no way of verifying the result is correct - heck, you can't even verify whether the resulting document is syntactically valid - I'd advice you to leave the document as is. Then even the most basic check (for well-formedness) will flag the document to be incorrect. Otherwise, you end up with a document that appears to be correct, but you've no way of knowing. Of course, that raises the question, if you don't have the DTD, how useful is the document, and why is it being considered for "repair"?

      Hi, I found this conversation very interesting. Did you have any further thoughts on the repair problem (without having a DTD)?