in reply to Repair malformed XML

I would definitely give XML::LibXML a try. It has a nice command line tool, xmllint, which can make wonders if used correctly. On the other hand, if you want Perl, you should experience with setting the recover flag of the XML::LibXML::Parser object to true. Although the manual states that it is for parsing HTML, it, as far as I can tell, serves for parsing ill-formatted XML just as well.

The quick and dirty hack below could repair your badly formatted XML snippet (after adding the missing namespace declarations):

use XML::LibXML; my $parser = XML::LibXML->new(); $parser->recover(1); my $doc = $parser->parse_file($ARGV[0]); print $doc->toString(1);
Note that, however, I am not entirely sure that it always gueesses right on adding the remaining closing tags back, so I would not rely on this feature...

rg0now

Replies are listed 'Best First'.
Re^2: Repair malformed XML
by spoulson (Beadle) on Feb 03, 2005 at 20:54 UTC
    I stand corrected about the size limitation. Upon further testing, it is not the size, but the encoding. The XML file is unicode with encoding="iso-10646-ucs-2". If I convert to ASCII and set encoding="UTF-8", LibXML parses it fine.

    Unfortunately, the output of above script becomes mangles after a few thousand lines. It begins to only output the Text objects, and no tags, cdata's, etc. Strange.

    While I haven't discovered a generalized and automated method that works, I've managed to get by with a simple procedural rule of inserting </o:version> tags before </cc:files> if not already present. Then I convert back to Unicode and the XML can be parsed.

Re^2: Repair malformed XML
by spoulson (Beadle) on Feb 03, 2005 at 16:55 UTC
    I was unaware of the recover property. Your code example worked great on a test xml with a missing tag.

    However, it appears I've reached a size limitation on the LibXML library. Both xmllint and the Perl code indicate problems parsing corrupt data:

    my.xml:85: parser error : expected '>' tentclasses>True</s:closedexpectæ"?æ??æO?æ¼?æ,?ç??æ"?æ,?ç??æO?æ°?æ,?çO +?çO?æ"?çO?
    I've toyed with XML::Parser some more. I've given it simple handlers to print the tags that are parsed, but XML::Parser croaks when it detects the missing tag, without first allowing a handler to override it.

    Is there something I'm missing?

      You don't say what version of perl you're using. My first attempt to use XML::Twig was with perl 5.6, and it died a horrible death ... simply upgrading perl to 5.8.1 was sufficient to handle the reading/writing of XML that I was doing with no other changes (same level of XML::Twig, my code unchanged). If you're not using 5.8 for XML handling, I highly suggest it.

        Thanks for the suggestion. Yes, I am running ActiveState 5.8.6 on WinXP. I'll have a look at XML::Twig, as well.
      I am a little lost here. You told us that all the problems you have with your XML is that it has some unclosed tags. XML::LibXML::Parser's recover flag will handle it, as the manual tells:

      "The recover mode helps to recover documents that are almost wellformed very efficiently. That is for example a document that forgets to close the document tag (or any other tag inside the document)."

      Now, you seem to indicate that some tags in your XML are corrupt. Well, I do not really know, how to handle that one...

      Also, I do not think that you hit some obscure size limitations of XML::LibXML (you seem to get the error at the 85th input line).

        Sorry if I was unclear. The XML data is not corrupt. It appears that LibXML cannot load an 80MB XML without corrupting its own data. When I search within the XML, I do not find the offending parser error on line 85, or anywhere in the file.

        So, I believe it to be a size limitation that causes internal memory management issues. Why 85th line? Maybe a pointer wrapped and happened to clobber the 85th line. Who knows. :/