clinton has asked for the wisdom of the Perl Monks concerning the following question:

Hi All

I receive XML files from my client with a list of orders. The XML (over which I have no control) may or may not contain some dodgy characters, which would cause XML parsing to fail.

However, I would like to process all the orders that I can, and report errors on the dodgy ones.

My understanding is that, if I use

$parser = XML::LibXML->new(); $doc = $parser->parse_file( $xmlfilename );
then the file will be parsed quickly, but either succeed or fail in its entirety.

I was thinking about using this as the first method, for speed, and if it fails, resort to something like:

$p = XML:LibXML->new(); local $/='</order>'; open (FH,'<:utf8',$filename) or die $!; while (my $order = <FH>) { $order=~s/^.*?<order>/<order>/gs; my $xml = <<XML; <?xml version="1.0" encoding="UTF-8"?> <orders> $order </orders> XML my $doc = eval {$parser->parse_string($xml)}; if ($@) { warn ("error : $@"); next; } process_orders($doc); }
Or should I be creating one master document and importing/adopting nodes? Or a different approach entirely? thanks

Replies are listed 'Best First'.
Re: Parsing dodgy XML
by mirod (Canon) on Sep 20, 2006 at 14:54 UTC

    Did you look at XML::Liberal? It looks like it would do what you want in a transparent way (don't know about the performance hit though).

      XML::Liberal looks sweet, and it correctly interpreted my dodgy file.

      It uses XML::LibXML as a base, and seems fast enough. How much speed can you ask for when you have to correct dumb avoidable errors?

      My only concern is that it is alpha and warns that it is liable to change, but I reckon it is usable, and the interface to underlying XML::LibXML methods are the same, so it is probably a safe bet.

      Also, if I try with the strict parsing first, then fall back to XML::Liberal, I'll probably be OK.

      many thanks

Re: Parsing dodgy XML
by shmem (Chancellor) on Sep 20, 2006 at 14:55 UTC
    As a different approach, merlyn describes his use of HTML::Parser for parsing xml with dodgy characters in his column The Wrong Parser for the Right Reasons (Jun 03). Maybe it's useful for you.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Parsing dodgy XML
by merlyn (Sage) on Sep 20, 2006 at 15:26 UTC
    Although you can work around this, the proper answer is to keep pushing back to the client to say "I will need an XML file from you... please let me know when you have an XML file".

    Any file that is "mostly XML" is not XML. It's like being "mostly pregnant": there's no such thing.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Likelihood of refusing bad XML tends towards zero as desperation to win new business tends towards infinity....

      I will be giving them an XSD and a small script to test validation, but if they (and 'they' are a national newspaper) use the current XML, then getting them to change it may prove difficult.

OT: Parsing dodgy XML
by astroboy (Chaplain) on Sep 21, 2006 at 09:45 UTC

    I was on a project with another supplier where the kept on giving me dodgy XML. It got to the point that I wrote the XSD and said to them "don't send me the XML unless it validates against this." Well, after a year-and-a-half they still haven't been able to sort their XML out.

    The bright side is that last week they were dumped by our mutual customer (and it looks like I'm going to get awarded their project). I can only assume that these companies are building XML with print statements. In my case, they other supplier defined the XML structure for the project, but they couldn't even follow their own definitions.