Re: Dealing with Malformed XML

Hey, thanks for the comments on XML::Twig, it's always nice to see a happy user!

As for your problem, this is what the XML spec has to say about what an XML processor should do when encountering a fatal error:

fatal error

Definition: An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way).

And I can't resist giving you Tim Bray's comments (in ) The Annoted XML Specification:

This innocent-looking definition embodies one of the most important and unprecedented aspects of XML: "Draconian" error-handling. Dracon (c.659-c.601 B.C.E.) introduced the first written legislation to Athens. His code was consistent in that it decreed the death penalty for crimes both low and high. Similarly, a conforming XML processor must "not continue normal processing" once it detects a fatal error. Phrases used to amplify this wording have included "halt and catch fire", "barf", "flush the document down the toilet", and "penalize innocent end-users".

And if you think the sentence "the processor may continue processing the data to search for further errors and may report such errors to the application" gives you a glimmer of hope, well XML::Parser (and thus XML::Twig) chooses to just die as soon as an error is encountered.

Of course you can use eval to catch the death (as you seem to have done your homework ;--) you have most likely read XML::Parser), and in the latest development version of XML::Twig I added the safe_parse and safe_parsefile methods to take care of this for you. But in any case parsing will _not_ resume after the first error. That's the XML way.

So yes, you should write a pre-filter to handle those entities.

The main problems are & and <, > needs only to be escaped in attribute values, which should not be a huge problem.

& is fairly easy: it should be replaced by & except when it is already used in an entity. So the folowwing regexp will do:

s{&                           
   (!>(\w+                 # regular text entity
       |#\d+               # decimal character entity 
       |#x[a-fA-F0-9]+)    # hexa character entity
                       ;)
  }{&amp;}gx
[download]

This should get rid of most of the unwanted &s, exept the occasional "I like Johnson&Johnson; I just wish they did a better job with the Jets" where the & will not get replaced and you will get an error on the unknown entity &Johnson; when parsing.

< can be a nastier problem. I should now as I've had to deal with half-ass conversions which left a bunch of them lying around ;--( What I did is that in my documents all you tags were written <tag with no space between < and tag, and most of the time < was used it was followed either by a space or by a number (actually this is valid in SGML so I could not even blame the conversion!) so I ended up with the following substitution:

s{<(?>[\s\d])}{<}g;

Normally this should do, but let us know if you have problems with >, " and '.

Comment on Re: Dealing with Malformed XML Select or Download Code

Replies are listed 'Best First'.
Re^2: Dealing with Malformed XML by vili (Monk) on Mar 02, 2005 at 23:42 UTC
I am having major problems with ' And while I agree that the problems should be handled by the originator of the XML, it is impeding my progres, and if there is a way to handle the single smurfin' quote while using the XML::Parser, I'd love to know it.	[reply]