in reply to Re: Tokenizing XML
in thread Tokenizing XML

You say:

OTOH, (6 dashes) is fine,

I did what you told me and took a look at the W3C recommendations and it says:
the string "--" (double-hyphen) MUST NOT occur within comments.
So I think, this should do for a legal comment:
<!--(?:[^-]|-[^-])*-->


s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re^3: Tokenizing XML
by Aristotle (Chancellor) on Dec 27, 2005 at 00:33 UTC

    With six dashes it doesn’t occur within comments: six dashes mean one complete (empty) comment (the first 4 dashes) and one open comment that stretches past the angle bracket (the next 2 dashes). 8 dashes in a row would be valid and do what you expect (because they indicate 2 empty comments, both closed). So sequences of 4 dashes do no cause confusion. Neither do sequences of 5 dashes, provided that a 5-dash-sequence is followed by a non-dash character.

    Your regex will reject valid comments.

    Ok, turns out that I’m applying SGML rules and that they have been simplified for XML. I guess I should have another read over the spec myself, sigh. That regex should work then.

    Makeshifts last the longest.