in reply to Regex optimization: Can (?> ) and minimal match help here?

(Original poster here, de-anonymized.) One or two side-notes, for any XML geeks out there.

First, according to the spec, AttValues are allowed to contain pretty much anything except for < and their own closing quote character (I view this as a design error). I have here excluded > as well, but there will be a pre-filter to prevent stupid/ugly XML like that from ever getting this far. (Preceding program logic also ensures that this regex only ever sees strings bounded by balanced tags, so that's another worry gone.)

Second, according to the spec, a Name is a little more restricted in its composition than what I've got here. But the regex will still pick out a Name and only a Name because its character class excludes whitespace and =.

Third, I apologize for the really gross bit that captures the value of type, with the lookbehind and the conditional, but it solves a problem (namely, How do you capture just what's between the quotes, when you don't know what's allowed between the quotes until you've seen them?). It's easier to read if you imagine it says ($attValue) and just mentally supply the magic that breaks off the quote marks (and I coded such a version, using s///e, but it was slooooow compared to the lookbehind/conditional). If any XML geek reads this who happens to end up on a future spec committee, remember that giving everybody their favorite English quote character is nice, but it has a parsing cost.

Fourth, if you're saying that weird transformations of XML like this really shouldn't be necessary, I personally believe you're absolutely correct. The XML we're transforming uses the same elements to do different things, yuck, in a way that XSD can't (currently) validate, so we're pretty much transforming it into the form that should have been specified in the first place, which we can validate with XSD (and which is more readable anyway).

Replies are listed 'Best First'.
Re^2: Regex optimization: Can (?> ) and minimal match help here?
by ikegami (Patriarch) on Jun 17, 2008 at 23:43 UTC

    remember that giving everybody their favorite English quote character is nice, but it has a parsing cost.

    A real parser doesn't suffer from that problem.