(Original poster here, de-anonymized.) One or two side-notes, for any XML geeks out there.

First, according to the spec, AttValues are allowed to contain pretty much anything except for < and their own closing quote character (I view this as a design error). I have here excluded > as well, but there will be a pre-filter to prevent stupid/ugly XML like that from ever getting this far. (Preceding program logic also ensures that this regex only ever sees strings bounded by balanced tags, so that's another worry gone.)

Second, according to the spec, a Name is a little more restricted in its composition than what I've got here. But the regex will still pick out a Name and only a Name because its character class excludes whitespace and =.

Third, I apologize for the really gross bit that captures the value of type, with the lookbehind and the conditional, but it solves a problem (namely, How do you capture just what's between the quotes, when you don't know what's allowed between the quotes until you've seen them?). It's easier to read if you imagine it says ($attValue) and just mentally supply the magic that breaks off the quote marks (and I coded such a version, using s///e, but it was slooooow compared to the lookbehind/conditional). If any XML geek reads this who happens to end up on a future spec committee, remember that giving everybody their favorite English quote character is nice, but it has a parsing cost.

Fourth, if you're saying that weird transformations of XML like this really shouldn't be necessary, I personally believe you're absolutely correct. The XML we're transforming uses the same elements to do different things, yuck, in a way that XSD can't (currently) validate, so we're pretty much transforming it into the form that should have been specified in the first place, which we can validate with XSD (and which is more readable anyway).


In reply to Re: Regex optimization: Can (?> ) and minimal match help here? by eritain
in thread Regex optimization: Can (?> ) and minimal match help here? by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.