Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

Merry Christmas fellow monks!

For a beanshell (yes! It's not Perl) macro I need a regular expression for tokenizing XML.

I've read several nodes here about not to parse XML using regular expressions. But since I don't want to parse it just to tokenize all the parts of an XML file in a String, I thought it might be a good idea to ask for your assistance.

The regular expression I have now (see below) is sufficiant for the XML in question. But if it's not too much overhead, I'd love to be able to tokenize any valid XML part with it. Or, to be specific, just tags, comments, CDATA, and prolog. I don't care for entities or any DTD.

The expression I have now is: (split up for readability)

(?s) (?: (<\w+ (?:\s*\b\w+= (?: "[^<"]*" | '[^<']*' ) )* \s*/?> ) | (<!--.*?-->) | (<!\[CDATA\[.*?\]\]>) | (<!\w+.*?>) | (<\?xml.*?\?>) | (</\w+>) )

If this matches, one of these back references is not empty:

  1. Element
  2. Comment
  3. CDATA
  4. DOCTYPE (etc.?)
  5. Prolog
  6. End tag
Maybe some XML expert monks sees some bug in my regex or you see enhancements I could include?

Many thanks in advance!


s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re: Tokenizing XML
by Aristotle (Chancellor) on Dec 26, 2005 at 15:05 UTC

    The comment matching subexpression is way too liberal. How it really works is that <! opens a markup declaration, and within such, a -- double dash starts a comment and another one terminates the comment. In other words, <!-----> (5 dashes) cannot be part of a well-formed document: <! starts the declaration, -- starts a comment, -- closes it, but then there’s an extraneous - dash which is outside the comment and is a markup declaration syntax error. OTOH, <!------> (6 dashes) is fine, except it will comment out much more than you think, because the first 4 dashes open an close an empty comment, and then the last two dashes open a comment that is not closed at this point. However, if you write something like <!--- --> (3 dashes, space, 2 dashes), that’s fine: the first 2 dashes open the comment, the next dash is lone so it’s part of the comment (along with the space that follows), and the last 2 dashes close the comment.

    Likewise, your parsing for the XML PI is too simpleminded; additionally, you have no provision for parsing other kinds of PIs at all.

    You may have more errors; I didn’t look any harder than that.

    You’d do well to just read the spec; it really isn’t prohibitively big or difficult to understand, and if you want to write a correct parser, there’s no way around it.

    Because despite claims to the contrary, what you’re doing is parsing. And regexen are certainly a fine tool to do that. The oft-repeated advice is because people who use regexen usually don’t actually parse, and in general, do not have a task that merits the effort required to write a complete, correct parser on their own either, so telling them to use an existing parser is exactly what they need.

    Makeshifts last the longest.

      Hmmm I was playing with the comment sub expression and I think this might work (dunno, i played with look aheads but those arn't realy my speciality. ;))

      my $test = qr/(<!--(?:[-][^-]|[^-])*-->)/; for ( "<!-- -->", "<!----->", "<!--- -->", "<!---->" , "<!-- - -- -->" +) { print $_, ($_ =~ $test ? "ok": "fail"),"\n"; }

      So it should match anything that looks like a comment and the only tuime it has two dashes in a row is at the begging and at the end. Needs mroe vigerous testing though.


      ___________
      Eric Hodges

      You say:

      OTOH, (6 dashes) is fine,

      I did what you told me and took a look at the W3C recommendations and it says:
      the string "--" (double-hyphen) MUST NOT occur within comments.
      So I think, this should do for a legal comment:
      <!--(?:[^-]|-[^-])*-->


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

        With six dashes it doesn’t occur within comments: six dashes mean one complete (empty) comment (the first 4 dashes) and one open comment that stretches past the angle bracket (the next 2 dashes). 8 dashes in a row would be valid and do what you expect (because they indicate 2 empty comments, both closed). So sequences of 4 dashes do no cause confusion. Neither do sequences of 5 dashes, provided that a 5-dash-sequence is followed by a non-dash character.

        Your regex will reject valid comments.

        Ok, turns out that I’m applying SGML rules and that they have been simplified for XML. I guess I should have another read over the spec myself, sigh. That regex should work then.

        Makeshifts last the longest.

Re: Tokenizing XML
by merlyn (Sage) on Dec 26, 2005 at 17:52 UTC

      Not quite...

      As I need it inside a macro and I just want to locate elements etc. a pure perl parser is helpfule for analysing the source and maybe taking parts of it, but from a first glance at it, I think it's way too much.

      As a matter of fact, my "parser" need not choke on invalid xml. The macro will be used when editing XML and so the XML might well be invalid. I rely on other plugins of the editor to report invalidity. Unfortunately I have no idea (yet) how to utilize these other plugins (which already parse the xml) so I came up with my regex in order to find whatever the macro searches for.

      Nevertheless: Thanks for pointing me at the modul.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e