in reply to Dealing with Malformed XML

Hey, thanks for the comments on XML::Twig, it's always nice to see a happy user!

As for your problem, this is what the XML spec has to say about what an XML processor should do when encountering a fatal error:

fatal error

And I can't resist giving you Tim Bray's comments (in ) The Annoted XML Specification:

This innocent-looking definition embodies one of the most important and unprecedented aspects of XML: "Draconian" error-handling. Dracon (c.659-c.601 B.C.E.) introduced the first written legislation to Athens. His code was consistent in that it decreed the death penalty for crimes both low and high. Similarly, a conforming XML processor must "not continue normal processing" once it detects a fatal error. Phrases used to amplify this wording have included "halt and catch fire", "barf", "flush the document down the toilet", and "penalize innocent end-users".

And if you think the sentence "the processor may continue processing the data to search for further errors and may report such errors to the application" gives you a glimmer of hope, well XML::Parser (and thus XML::Twig) chooses to just die as soon as an error is encountered.

Of course you can use eval to catch the death (as you seem to have done your homework ;--) you have most likely read XML::Parser), and in the latest development version of XML::Twig I added the safe_parse and safe_parsefile methods to take care of this for you. But in any case parsing will _not_ resume after the first error. That's the XML way.

So yes, you should write a pre-filter to handle those entities.

The main problems are & and <, > needs only to be escaped in attribute values, which should not be a huge problem.

& is fairly easy: it should be replaced by &amp; except when it is already used in an entity. So the folowwing regexp will do:

s{& (!>(\w+ # regular text entity |#\d+ # decimal character entity |#x[a-fA-F0-9]+) # hexa character entity ;) }{&amp;}gx

This should get rid of most of the unwanted &s, exept the occasional "I like Johnson&Johnson; I just wish they did a better job with the Jets" where the & will not get replaced and you will get an error on the unknown entity &Johnson; when parsing.

< can be a nastier problem. I should now as I've had to deal with half-ass conversions which left a bunch of them lying around ;--( What I did is that in my documents all you tags were written <tag with no space between < and tag, and most of the time < was used it was followed either by a space or by a number (actually this is valid in SGML so I could not even blame the conversion!) so I ended up with the following substitution:

s{<(?>[\s\d])}{&lt;}g;

Normally this should do, but let us know if you have problems with >, " and '.

Replies are listed 'Best First'.
Re^2: Dealing with Malformed XML
by vili (Monk) on Mar 02, 2005 at 23:42 UTC
    I am having major problems with ' And while I agree that the problems should be handled by the originator of the XML, it is impeding my progres, and if there is a way to handle the single smurfin' quote while using the XML::Parser, I'd love to know it.