Nocturnus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am seeking advice on how to parse XML / HTML files when the following special requirements are given:

So far, I have tried four different parsers (XML::SAX, XML::LibXML::SAX, XML::Parser, XML::LibXML::Reader) and read a lot about other parsers I possibly could use, but all failed or seem inappropriate in one respect or another.

I hope that somebody could give me a hint on how to achieve the goals described above. To make the most important issues clear, I could give a short example. Suppose there is the following XML document:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w +3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <documtentroottag> <tagonly attrib1="foo" attrib2="bar" /> <tagwithtext> (note the space) This is normal character data with one +defined (&lt;) and one undefined (&undefinedentity;).</tagwithtext> <!-- I am a comment --> </documentroottag>

Given that document:

At first, the parser must throw an event telling me that there is non-character data and providing access to the respective unmodified, unparsed string. Alternatively, the parser could be more precise and tell me that it has found an XML declaration; but in this case as well, the parser must provide access to the respective unmodified, unparsed string.

The next event I am interested in is after the closing > of the XML declaration; the state changes from non-character data to character data here (the character data in this case is only a newline char, though). Once again, the parser must throw an event which tells me about that change of state, and must provide access to the unmodified, unparsed original character data.

The same analogously applies for all lines of the document.

Regarding the bytes enclosed by <tagwithtext> and </tagwithtext>, the parser does not need to resolve internal or external entities and must not throw errors about undefined entities; in every case, even if it successfully has resolved the entities, it must provide access to the original, unparsed characters, including all entities in unresolved (original) form.

The same analogously applies to all possible syntactic blocks of XML documents (attribute declarations, processing instructions, CDATA, ...): The parser must inform me about any change from character data to non-character data and vice versa; it does not need to, but is allowed to inform me about what sort of non-character exactly is being parsed at the moment. In every case, I need access to the original, unmodified data.

I hope I have described precisely what I would like to do, and I hope somebody could help me with that problem.

Thank you very much,

Nocturnus

  • Comment on Seeking for advice: XML parsing with special requirements [Solved]
  • Download Code

Replies are listed 'Best First'.
Re: Seeking for advice: XML parsing with special requirements (regex)
by tye (Sage) on Apr 22, 2012 at 18:18 UTC

    I'm not sure, but it sounds like you might have invalid XML which means a "real" XML parser isn't going to work.

    But, in any case, since you've already spent a ton of time fighting against a variety of full-fledged parsers, I'd just write my own parser. It actually is quite easy to write a real parser for whatever subset of XML one has to deal with. And that makes it trivial to deal with unusual things (that might not even strictly be valid XML) and trivial to get access to whatever matters to you.

    Re^2: parsing XML fragments (xml log files) with... a regex shows how easy it was for me to deal with the types of XML I ran into. And the code is trivial to extend to cover more parts of XML to meet your needs.

    Note that, as written, my code expects the full XML string. But it would be easy to modify it to just read a reasonably large chunk of text and, when pos gets 1/2 way through (or when an unclosed < is encountered), to just strip what has been parsed so far and append more.

    - tye        

      Thank you very much for your reply and helpful comments.

      You are right in that my XML might be invalid in the rigid sense of the specification. Nevertheless, it's well-balanced and does not have any problems except the ones which are related to the entities.

      Regarding writing an own parser: I have thought about that since some time ago I already have written some simple parsers for other tasks. But after having looked into the XML specification, I have come to the conclusion that it is not possible to write a full XML parser (excluding entity resolving) in reasonable time.

      In fact, to achieve what I need, I have to use a full XML parser with all bells and whistles: Think of encoding, namespaces, the various sorts of declarations (attribute, element, ...), CDATA, PIs, and so on.

      Probably it would be easier to modify one of the existing reliable parsers.

      Thanks again,

      Nocturnus

Re: Seeking for advice: XML parsing with special requirements
by Anonymous Monk on Apr 22, 2012 at 10:14 UTC

    So far, I have tried four different parsers (XML::SAX, XML::LibXML::SAX, XML::Parser, XML::LibXML::Reader) and read a lot about other parsers I possibly could use, but all failed or seem inappropriate in one respect or another.

    AFAIK, XML::Parser fits your requirements for sure

      Thank you very much for bothering!

      I had some problems with XML::Parser:

      If it sees unresolvable entities (which I admit is formally an error in the XML document), it calls the default handler regardless of what handlers you have installed. This makes things more difficult, but I could live with it (I already had changed my code accordingly).

      The disqualifier is: In a handler, you get the original (unparsed) string by invoking the underlying expat instance via

      $_[0] -> original_string

      or

      $_[0] -> recognized_string

      That would be nice and easy in the first place, but in some cases, there is only rubbish in the respective string; this is true for nearly all of the declaration blocks (for example doctype declarations and attribute declarations). The expat documentation is explicitly confirming this observation; unfortunately, it's a thing I can't live with.

      As far as I know, XML::Parser always is based on expat, but perhaps, I have misunderstood something. If the latter is the case, I would be grateful if somebody could show me how to use XML::Parser with another underlying parser.

      Thank you very much,

      Nocturnus

Re: Seeking for advice: XML parsing with special requirements
by Jenda (Abbot) on Apr 23, 2012 at 08:31 UTC

    What about some preprocessing? If the undefined entities are the only problem with one of the parsers you've tried, you might try to first scan the XML for entities by a simple regexp, build a list of entities, "define" them and add a reference to the definition to the XML header. Not a perfect solution, but ...

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      Thank your very much for your answer.

      Unfortunately, preprocessing would be complicated. I would have to store the position of every entity, and, after having parsed the document, re-insert the entities because I need most of the document in original form.

      Regards,

      Nocturnus

        What if you define them like this:

        <!DOCTYPE videocollection [ <!ENTITY SF "&amp;SF;"> ]>

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.