in reply to Dealing with Malformed XML
Hey, thanks for the comments on XML::Twig, it's always nice to see a happy user!
As for your problem, this is what the XML spec has to say about what an XML processor should do when encountering a fatal error:
fatal error
And I can't resist giving you Tim Bray's comments (in ) The Annoted XML Specification:
This innocent-looking definition embodies one of the most important and unprecedented aspects of XML: "Draconian" error-handling. Dracon (c.659-c.601 B.C.E.) introduced the first written legislation to Athens. His code was consistent in that it decreed the death penalty for crimes both low and high. Similarly, a conforming XML processor must "not continue normal processing" once it detects a fatal error. Phrases used to amplify this wording have included "halt and catch fire", "barf", "flush the document down the toilet", and "penalize innocent end-users".
And if you think the sentence "the processor may continue processing the data to search for further errors and may report such errors to the application" gives you a glimmer of hope, well XML::Parser (and thus XML::Twig) chooses to just die as soon as an error is encountered.
Of course you can use eval to catch the death (as you seem to have done your homework ;--) you have most likely read XML::Parser), and in the latest development version of XML::Twig I added the safe_parse and safe_parsefile methods to take care of this for you. But in any case parsing will _not_ resume after the first error. That's the XML way.
So yes, you should write a pre-filter to handle those entities.
The main problems are & and <, > needs only to be escaped in attribute values, which should not be a huge problem.
& is fairly easy: it should be replaced by & except when it is already used in an entity. So the folowwing regexp will do:
s{& (!>(\w+ # regular text entity |#\d+ # decimal character entity |#x[a-fA-F0-9]+) # hexa character entity ;) }{&}gx
This should get rid of most of the unwanted &s, exept the occasional "I like Johnson&Johnson; I just wish they did a better job with the Jets" where the & will not get replaced and you will get an error on the unknown entity &Johnson; when parsing.
< can be a nastier problem. I should now as I've had to deal with half-ass conversions which left a bunch of them lying around ;--( What I did is that in my documents all you tags were written <tag with no space between < and tag, and most of the time < was used it was followed either by a space or by a number (actually this is valid in SGML so I could not even blame the conversion!) so I ended up with the following substitution:
s{<(?>[\s\d])}{<}g;Normally this should do, but let us know if you have problems with >, " and '.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Dealing with Malformed XML
by vili (Monk) on Mar 02, 2005 at 23:42 UTC |