Nocturnus has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks,
I am seeking advice on how to parse XML / HTML files when the following special requirements are given:
So far, I have tried four different parsers (XML::SAX, XML::LibXML::SAX, XML::Parser, XML::LibXML::Reader) and read a lot about other parsers I possibly could use, but all failed or seem inappropriate in one respect or another.
I hope that somebody could give me a hint on how to achieve the goals described above. To make the most important issues clear, I could give a short example. Suppose there is the following XML document:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w +3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <documtentroottag> <tagonly attrib1="foo" attrib2="bar" /> <tagwithtext> (note the space) This is normal character data with one +defined (<) and one undefined (&undefinedentity;).</tagwithtext> <!-- I am a comment --> </documentroottag>
Given that document:
At first, the parser must throw an event telling me that there is non-character data and providing access to the respective unmodified, unparsed string. Alternatively, the parser could be more precise and tell me that it has found an XML declaration; but in this case as well, the parser must provide access to the respective unmodified, unparsed string.
The next event I am interested in is after the closing > of the XML declaration; the state changes from non-character data to character data here (the character data in this case is only a newline char, though). Once again, the parser must throw an event which tells me about that change of state, and must provide access to the unmodified, unparsed original character data.
The same analogously applies for all lines of the document.
Regarding the bytes enclosed by <tagwithtext> and </tagwithtext>, the parser does not need to resolve internal or external entities and must not throw errors about undefined entities; in every case, even if it successfully has resolved the entities, it must provide access to the original, unparsed characters, including all entities in unresolved (original) form.
The same analogously applies to all possible syntactic blocks of XML documents (attribute declarations, processing instructions, CDATA, ...): The parser must inform me about any change from character data to non-character data and vice versa; it does not need to, but is allowed to inform me about what sort of non-character exactly is being parsed at the moment. In every case, I need access to the original, unmodified data.
I hope I have described precisely what I would like to do, and I hope somebody could help me with that problem.
Thank you very much,
Nocturnus
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Seeking for advice: XML parsing with special requirements (regex)
by tye (Sage) on Apr 22, 2012 at 18:18 UTC | |
by Nocturnus (Scribe) on Apr 23, 2012 at 18:13 UTC | |
|
Re: Seeking for advice: XML parsing with special requirements
by Anonymous Monk on Apr 22, 2012 at 10:14 UTC | |
by Nocturnus (Scribe) on Apr 22, 2012 at 14:04 UTC | |
|
Re: Seeking for advice: XML parsing with special requirements
by Jenda (Abbot) on Apr 23, 2012 at 08:31 UTC | |
by Nocturnus (Scribe) on Apr 28, 2012 at 16:53 UTC | |
by Jenda (Abbot) on Apr 29, 2012 at 16:03 UTC | |
by Nocturnus (Scribe) on Aug 25, 2012 at 10:24 UTC |