XML::LibXML is actually a better choice than XML::Parser::Checker: it is faster, better maintained and SAX compliant. It also has an HTML parser, which might help you if the malformed XML you receive happens to be some sort of HTML.

In general though, you are going down a dangerous path. There is a reason why the XML spec requires that a conforming XML processor must "not continue normal processing" once it detects a fatal error (Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way)), see 1.2 Terminology in the annotated XML spec, Tim Bray's comment about it is also instructive.

By accepting non-conformant XML in the system you will create all sorts of problems down the line, most of which being impossible to fix programatically. I know it is not always easy to tell customers, or other departments of your company, that you can't accept what they send you, but the XML spec is there to back you up, and get them (and you!) to do The Right Thing (tm).

If you really have to accept non-comformant XML, you should not expect an XML parser to deal with it (they won't!). Try to code a pre-processing step, which won't rely on XML tools, to convert the data to well-formed XML. From there you can then use XML tools to convert it to valid (ie conformant to your DTD) XML. Check the data after this pre-processing and build the rest of your process with XML tools. Writing the pre-processing step will be Hell but it will at least isolate the, pardon my French, crap they send you from your XML process.

Good luck!


In reply to Re: Re: Re: XML::Checker::Parser by mirod
in thread XML::Checker::Parser by rubric

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.