Coyote has asked for the wisdom of the Perl Monks concerning the following question:

Venerable Denizens of the Monastary --

I am working on a project to take scanned articles from an academic journal which have been marked up in XML by hand and do nifty things with the documents (i.e., produce HTML, TeX, and PDF output; pull abstracts from the articles; cross reference the articles and so on). The articles have APA style references such as (author1 & author2, 1999) embedded in the text and the occasional article has a bare > or < as part of the text. I am using mirod's excellent XML::Twig module to parse and process the data. As you would expect, the parser dies when it comes across an &, >, or <. For example, if I have the following code with an XML sample embedded as a here doc --

use strict; use warnings; use XML::Twig; my $xml = <<XML; <reference> <authors> <author>Lindsley, O.R.</author> </authors> <year>1992</year> <title>Precision teaching: Discoveries & effects.</title> <source>Journal of Applied Behavior Analysis</source> <volume>25</volume> <pages>51-57</pages> </reference> XML my $twig = new XML::Twig(); $twig->parse($xml);
The parser dies with this error message -- not well-formed at line 8, column 50, byte 198 at C:/Perl/site/lib/XML/Parser.pm line 168 Now I know where the parser died, and a quick examination of the XML turns up a bare & which is a malformed entity. I would like to be able to correct this error as while the document is parsing, however I can't seem to find an option in either the XML::Parser or XML::Twig man pages to allow me to handle the error or at least report the error and continue parsing the document so I can gather all the problems with the document in one pass. Am I using the right tool for the job with XML::Twig? Should I write some sort of preprocessor or filter to fix these problems before passing the data to XML::Twig? Has someone already written a module to do this? Any advice will be appreciated. I would rather not get into the XML parser writting business if I don't have to(see On XML parsing).

--
Coyote

Replies are listed 'Best First'.
Re: Dealing with Malformed XML
by mirod (Canon) on Jan 09, 2001 at 12:35 UTC

    Hey, thanks for the comments on XML::Twig, it's always nice to see a happy user!

    As for your problem, this is what the XML spec has to say about what an XML processor should do when encountering a fatal error:

    fatal error

    • Definition: An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way).

    And I can't resist giving you Tim Bray's comments (in ) The Annoted XML Specification:

    This innocent-looking definition embodies one of the most important and unprecedented aspects of XML: "Draconian" error-handling. Dracon (c.659-c.601 B.C.E.) introduced the first written legislation to Athens. His code was consistent in that it decreed the death penalty for crimes both low and high. Similarly, a conforming XML processor must "not continue normal processing" once it detects a fatal error. Phrases used to amplify this wording have included "halt and catch fire", "barf", "flush the document down the toilet", and "penalize innocent end-users".

    And if you think the sentence "the processor may continue processing the data to search for further errors and may report such errors to the application" gives you a glimmer of hope, well XML::Parser (and thus XML::Twig) chooses to just die as soon as an error is encountered.

    Of course you can use eval to catch the death (as you seem to have done your homework ;--) you have most likely read XML::Parser), and in the latest development version of XML::Twig I added the safe_parse and safe_parsefile methods to take care of this for you. But in any case parsing will _not_ resume after the first error. That's the XML way.

    So yes, you should write a pre-filter to handle those entities.

    The main problems are & and <, > needs only to be escaped in attribute values, which should not be a huge problem.

    & is fairly easy: it should be replaced by &amp; except when it is already used in an entity. So the folowwing regexp will do:

    s{& (!>(\w+ # regular text entity |#\d+ # decimal character entity |#x[a-fA-F0-9]+) # hexa character entity ;) }{&amp;}gx

    This should get rid of most of the unwanted &s, exept the occasional "I like Johnson&Johnson; I just wish they did a better job with the Jets" where the & will not get replaced and you will get an error on the unknown entity &Johnson; when parsing.

    < can be a nastier problem. I should now as I've had to deal with half-ass conversions which left a bunch of them lying around ;--( What I did is that in my documents all you tags were written <tag with no space between < and tag, and most of the time < was used it was followed either by a space or by a number (actually this is valid in SGML so I could not even blame the conversion!) so I ended up with the following substitution:

    s{<(?>[\s\d])}{&lt;}g;

    Normally this should do, but let us know if you have problems with >, " and '.

      I am having major problems with ' And while I agree that the problems should be handled by the originator of the XML, it is impeding my progres, and if there is a way to handle the single smurfin' quote while using the XML::Parser, I'd love to know it.
(Ovid) Re: Dealing with Malformed XML
by Ovid (Cardinal) on Jan 09, 2001 at 05:25 UTC
    No offense, but you're approaching it the wrong way. If you get bad data into a system, it's almost always preferable to go to the source of the data and correct the error there. If 'A' produces garbage and 'B' has to correct for that garbage, someone is going to come behind you eventually and have to maintain 'B'. If 'A' continuously puts out more garbage through human error, bad data into 'A', or whatever, then your method would be to continuously hack 'B' when 'B' is not the source of the problem.

    This is a Bad Thing. Fix the problems where they occur, not later down the road. Who knows? Maybe 'A' will eventually pass data to 'C' as well. Then you have garbage being spread to multiple places and garbage filters will have to be maintained independant of one another (unless some pointy-haired boss decides on a central garbage management system rather than clean up the mess). Code reuse then becomes impeded because the situation wasn't resolved properly the first time. But isn't that part of what XML was designed to avoid?

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

      I agree completely. The responsibility for making sure that the data is correct and well-formed is falls upon the people generating the data. I've already addressed this issue with user training and a filter to encode entities and before the data entry people mark up the articles.

      Unfortunately, I inherited this project after about 400 articles had already been scanned and marked up.

      ---- Coyote

(tye)Re: Dealing with Malformed XML
by tye (Sage) on Jan 09, 2001 at 05:33 UTC

    Do something like this before passing your text to the XML parser?

    s/&(\W|$)/&amp;$1/g; s/<([^/\w]|$)/&lt;$1/g; s/(^|\W)>/$1&gt;/g;
    But I think Re: Maximum parsing depth with XML::Parser? probably does a better job of this and implies the the greater-thans aren't a problem.

    I recall a module like this for HTML. It would find common mistakes (like unquoted attributes) and fix them. Something like that would be even more useful as a module for XML since the spec says to reject invalid input.

            - tye (but my friends call me "Tye")
      Thanks for the pointers. I don't think the <!CDATA ... solution detailed in Re: Maximum parsing depth with XML::Parser? is the right approach for this task. The &, <, and > characters should be entities in this instance.

      ---- Coyote (aka: Rich Anderson)

Re: Dealing with Malformed XML
by myocom (Deacon) on Jan 09, 2001 at 05:19 UTC

    Always make sure you're passing a well-formed document to an XML parser since it's in the spec to reject malformed documents. So for your case, you'll want to use something that will escape your &'s, for example.

Re: Dealing with Malformed XML
by Coyote (Deacon) on Jan 10, 2001 at 00:21 UTC
    Thanks to everyone who replied to this message.

    After giving the problem a bit more thought it occured to me that allowing the XML parser to ignore errors and to continue processing makes no more sense than allowing the perl intepreter to continue when it finds a syntax error. Moreover, allowing the XML parser to continue would lead to many of the same problems that we currently have with HTML. Permissive HTML parsers such as the one used by IE that will allow improperly nested tags, incomplete documents, unclosed tags, and so on lead to HTML designers to create that are usable only by the broken parser and foster bad programming/design habits. I would hate to see that happen with XML so I will not contribute to the problem by either writting an XML::Preprocessor module or adding this functionality to any sort of production system I create.

    As far as the solution to my problem goes, I wrote a small filter to take care of the & characters before passing the XML doc to the parser. I decided to ignore the bare > and < characters since their presence may indicate either a problem with tags in the document or a legitimate part of the document text.

    Once again, thanks for the insight.

    ----
    Coyote

(jptxs)Re: Dealing with Malformed XML
by jptxs (Curate) on Jan 09, 2001 at 20:28 UTC
    You could eval all your calls to the XML parser and then, if there are errors, examine the source more closely with regexen and the like. That way, you save parsing things twice for only the docs with problems and you don't have to bend XML laws to get what you need.
    "A man's maturity -- consists in having found again the seriousness one had as a child, at play." --Nietzsche