Dear Monks,
I am seeking advice on how to parse XML / HTML files when the following special requirements are given:
So far, I have tried four different parsers (XML::SAX, XML::LibXML::SAX, XML::Parser, XML::LibXML::Reader) and read a lot about other parsers I possibly could use, but all failed or seem inappropriate in one respect or another.
I hope that somebody could give me a hint on how to achieve the goals described above. To make the most important issues clear, I could give a short example. Suppose there is the following XML document:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w +3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <documtentroottag> <tagonly attrib1="foo" attrib2="bar" /> <tagwithtext> (note the space) This is normal character data with one +defined (<) and one undefined (&undefinedentity;).</tagwithtext> <!-- I am a comment --> </documentroottag>
Given that document:
At first, the parser must throw an event telling me that there is non-character data and providing access to the respective unmodified, unparsed string. Alternatively, the parser could be more precise and tell me that it has found an XML declaration; but in this case as well, the parser must provide access to the respective unmodified, unparsed string.
The next event I am interested in is after the closing > of the XML declaration; the state changes from non-character data to character data here (the character data in this case is only a newline char, though). Once again, the parser must throw an event which tells me about that change of state, and must provide access to the unmodified, unparsed original character data.
The same analogously applies for all lines of the document.
Regarding the bytes enclosed by <tagwithtext> and </tagwithtext>, the parser does not need to resolve internal or external entities and must not throw errors about undefined entities; in every case, even if it successfully has resolved the entities, it must provide access to the original, unparsed characters, including all entities in unresolved (original) form.
The same analogously applies to all possible syntactic blocks of XML documents (attribute declarations, processing instructions, CDATA, ...): The parser must inform me about any change from character data to non-character data and vice versa; it does not need to, but is allowed to inform me about what sort of non-character exactly is being parsed at the moment. In every case, I need access to the original, unmodified data.
I hope I have described precisely what I would like to do, and I hope somebody could help me with that problem.
Thank you very much,
Nocturnus
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |