in reply to Re: Repair malformed XML
in thread Repair malformed XML

I was unaware of the recover property. Your code example worked great on a test xml with a missing tag.

However, it appears I've reached a size limitation on the LibXML library. Both xmllint and the Perl code indicate problems parsing corrupt data:

my.xml:85: parser error : expected '>' tentclasses>True</s:closedexpectæ"?æ??æO?æ¼?æ,?ç??æ"?æ,?ç??æO?æ°?æ,?çO +?çO?æ"?çO?
I've toyed with XML::Parser some more. I've given it simple handlers to print the tags that are parsed, but XML::Parser croaks when it detects the missing tag, without first allowing a handler to override it.

Is there something I'm missing?

Replies are listed 'Best First'.
Re^3: Repair malformed XML
by Tanktalus (Canon) on Feb 03, 2005 at 17:25 UTC

    You don't say what version of perl you're using. My first attempt to use XML::Twig was with perl 5.6, and it died a horrible death ... simply upgrading perl to 5.8.1 was sufficient to handle the reading/writing of XML that I was doing with no other changes (same level of XML::Twig, my code unchanged). If you're not using 5.8 for XML handling, I highly suggest it.

      Thanks for the suggestion. Yes, I am running ActiveState 5.8.6 on WinXP. I'll have a look at XML::Twig, as well.
Re^3: Repair malformed XML
by rg0now (Chaplain) on Feb 03, 2005 at 17:38 UTC
    I am a little lost here. You told us that all the problems you have with your XML is that it has some unclosed tags. XML::LibXML::Parser's recover flag will handle it, as the manual tells:

    "The recover mode helps to recover documents that are almost wellformed very efficiently. That is for example a document that forgets to close the document tag (or any other tag inside the document)."

    Now, you seem to indicate that some tags in your XML are corrupt. Well, I do not really know, how to handle that one...

    Also, I do not think that you hit some obscure size limitations of XML::LibXML (you seem to get the error at the 85th input line).

      Sorry if I was unclear. The XML data is not corrupt. It appears that LibXML cannot load an 80MB XML without corrupting its own data. When I search within the XML, I do not find the offending parser error on line 85, or anywhere in the file.

      So, I believe it to be a size limitation that causes internal memory management issues. Why 85th line? Maybe a pointer wrapped and happened to clobber the 85th line. Who knows. :/