comment on

Dear Monks,

I am seeking advice on how to parse XML / HTML files when the following special requirements are given:

Parsing must be stream based since some of the documents to be parsed are very big.
The parser at least must notify me when there is a change from normal character data to non-character data or vice versa (the parser is allowed to signal additional events, though.)
Regardless of which notification (event) the parser throws, I need access to the original (unparsed) string which has caused the event.
No whitespace or non-significant / non-relevant (in the sense of the XML specification) bytes from the input stream must be thrown away when providing access to the original string.
Validation must not take place.
Checks for wellformedness may be done, but there must be means to suppress some of the respective error messages; notably, there must be no error messages about undefined entities.

So far, I have tried four different parsers (XML::SAX, XML::LibXML::SAX, XML::Parser, XML::LibXML::Reader) and read a lot about other parsers I possibly could use, but all failed or seem inappropriate in one respect or another.

I hope that somebody could give me a hint on how to achieve the goals described above. To make the most important issues clear, I could give a short example. Suppose there is the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w
+3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<documtentroottag>
<tagonly attrib1="foo" attrib2="bar" />
<tagwithtext> (note the space) This is normal character data with one 
+defined (&lt;) and one undefined (&undefinedentity;).</tagwithtext>
<!-- I am a comment -->
</documentroottag>
[download]

Given that document:

At first, the parser must throw an event telling me that there is non-character data and providing access to the respective unmodified, unparsed string. Alternatively, the parser could be more precise and tell me that it has found an XML declaration; but in this case as well, the parser must provide access to the respective unmodified, unparsed string.

The next event I am interested in is after the closing > of the XML declaration; the state changes from non-character data to character data here (the character data in this case is only a newline char, though). Once again, the parser must throw an event which tells me about that change of state, and must provide access to the unmodified, unparsed original character data.

The same analogously applies for all lines of the document.

Regarding the bytes enclosed by <tagwithtext> and </tagwithtext>, the parser does not need to resolve internal or external entities and must not throw errors about undefined entities; in every case, even if it successfully has resolved the entities, it must provide access to the original, unparsed characters, including all entities in unresolved (original) form.

The same analogously applies to all possible syntactic blocks of XML documents (attribute declarations, processing instructions, CDATA, ...): The parser must inform me about any change from character data to non-character data and vice versa; it does not need to, but is allowed to inform me about what sort of non-character exactly is being parsed at the moment. In every case, I need access to the original, unmodified data.

I hope I have described precisely what I would like to do, and I hope somebody could help me with that problem.

Thank you very much,

Nocturnus

In reply to Seeking for advice: XML parsing with special requirements [Solved] by Nocturnus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.