Skeeve has asked for the wisdom of the Perl Monks concerning the following question:
Merry Christmas fellow monks!
For a beanshell (yes! It's not Perl) macro I need a regular expression for tokenizing XML.
I've read several nodes here about not to parse XML using regular expressions. But since I don't want to parse it just to tokenize all the parts of an XML file in a String, I thought it might be a good idea to ask for your assistance.
The regular expression I have now (see below) is sufficiant for the XML in question. But if it's not too much overhead, I'd love to be able to tokenize any valid XML part with it. Or, to be specific, just tags, comments, CDATA, and prolog. I don't care for entities or any DTD.
The expression I have now is: (split up for readability)
(?s) (?: (<\w+ (?:\s*\b\w+= (?: "[^<"]*" | '[^<']*' ) )* \s*/?> ) | (<!--.*?-->) | (<!\[CDATA\[.*?\]\]>) | (<!\w+.*?>) | (<\?xml.*?\?>) | (</\w+>) )
If this matches, one of these back references is not empty:
Many thanks in advance!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Tokenizing XML
by Aristotle (Chancellor) on Dec 26, 2005 at 15:05 UTC | |
by eric256 (Parson) on Dec 26, 2005 at 21:15 UTC | |
by Skeeve (Parson) on Dec 26, 2005 at 23:35 UTC | |
by Aristotle (Chancellor) on Dec 27, 2005 at 00:33 UTC | |
|
Re: Tokenizing XML
by merlyn (Sage) on Dec 26, 2005 at 17:52 UTC | |
by Skeeve (Parson) on Dec 26, 2005 at 19:51 UTC |