Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've been working with a somewhat ugly regex that contains nested quantifiers. In the event that the match fails, those are asking for trouble. More generally, even when the match succeeds, I would like it to succeed quickly. But I remain uncertain as to where I can and cannot get benefits from atomic matching or minimal matching. I know what they do, it's just that the environment of use is hairy. I'll explain the task and give you the regex, and hope you've got some experience/instincts/profound understanding to contribute.

The task is pulling apart and reassembling some elements of an XML application. (Using someone else's existing, well-written XML parser is not an option because the reassembled version needs to preserve the line-by-line layout of the original.) The program feeds strings to the following regex to munge them; the strings themselves are guaranteed to begin with <someElementName and to end with the corresponding (balanced) </someElementName>, and this regex only needs to worry about munging those outermost tags.

The regex is defined thus:

$switchable = qr/foo|quux|glarf|boojum/; $name = qr/[^\s<>=]+/; # broader than the XML Spec, but that's OK $attValue = qr/(?:'[^'<>]*'|"[^"<>]*")/; $attribute = qr/(?:\s+(?>$name)\s*=\s*(?>$attValue))/; s/ ^ < ($switchable) ($attribute*?) # get the value of the type attribute (\s+) type (\s*=\s*) (['"]) # either quote allowable # but then the other quote char # or just about anything else # is allowed inside ... # minimal match for efficiency ((?(?<=')[^']*?|[^"]*?)) \5 # matching close quote ($attribute*?) (\s*) > (.*) # what's between tags, even newline <\/ \1 # matching end tag (\s*) > $ /<!--switch--><$6$2$3metaType$4$5$1$5$7$8>$9<\/$6$10>/sox

The expression works fine without any of the minimal-match or atomic-match modification you see here, so this is strictly an optimization problem, not a make-it-work problem.

The strings it will be applied to include both the elements it recognizes (foo, quux, glarf, boojum) and others (which it can ignore). Usually the ones it does recognize will have a type attribute, but sometimes not (in which case it can also ignore them). Usually type will be their only attribute, but very rarely not (in which case it needs to leave the other attributes alone). I know it matches fine, but I also need it to fail fine, i.e., fail without going exponential, even if the tag pair are separated by ten or twenty lines of intervening XML.

I have changed from greedy to minimal-match in several places where the matched element is expected to be short or absent and there is no danger that the next regex element could erroneously match too far to the left (this is the case with the matches to $attribute, which will usually match nothing in the sort of files I'm processing, and also the case when matching the value of the type attribute). I believe this will cut down on the amount of backtracking involved in a typical case.

I have also gotten some (?> ) modifiers in wherever two indefinite quantifiers enclose something that matches rather liberally (like $name or $attValue above). I believe I could use them around all such inner quantifiers, including the quantifiers on more limited matches like \s, but I'm not certain if it's useful.

In great measure, this is because I don't really know all the things that are or aren't already optimized in the Perl regex engine.

So, O monks, you have free rein to turn any of my greedy quantifiers to stingy or vice versa, and to put atomic matching around anything. You can even use 5.10 possessive quantifier notation if you want, and I'll translate it back to 5.8 style for my own use. What would you do to optimize this, and why would you do so? Thanks in advance.

Replies are listed 'Best First'.
Re: Regex optimization: Can (?> ) and minimal match help here?
by ikegami (Patriarch) on Jun 17, 2008 at 02:59 UTC

    I wouldn't expect much backtracking from your regexp, even in a failure case, so (?>...) will probably be of limited use. For example, either it finds an attribute or it doesn't, there's nothing else to try to match. You can try playing with (?>...) to find out for sure.

Re: Regex optimization: Can (?> ) and minimal match help here?
by moritz (Cardinal) on Jun 17, 2008 at 08:41 UTC
    I think $attValue could safely use non-backtracking groups. I think that ($attribute*?) really should be ($attribute*), because if you have multiple attributes, the RE engine first try to match zero, then one, then two etc., instead of matching them all in the first place.
    (.*) # what's between tags, even newline

    Are you sure you want that greedy, and with no further restrictions? Maybe it's appropriate in your case, but in the general case it's not good when parsing XML ;-)

      I think that ($attribute*?) really should be ($attribute*), because if you have multiple attributes, the RE engine first try to match zero, then one, then two etc., instead of matching them all in the first place.

      Yes, it would match them all, but that would be a bug. He doesn't want it to match the type= attribute since he's replacing it.

        You're right. One possible optimization is to change the regex for attributes not to match type (with negative lookahead), and then use the * star. I don't know if that's actually faster, but I'm sure the OP will benchmark it if speed is really important.
Re: Regex optimization: Can (?> ) and minimal match help here?
by eritain (Novice) on Jun 17, 2008 at 21:37 UTC
    (Original poster here, de-anonymized.) One or two side-notes, for any XML geeks out there.

    First, according to the spec, AttValues are allowed to contain pretty much anything except for < and their own closing quote character (I view this as a design error). I have here excluded > as well, but there will be a pre-filter to prevent stupid/ugly XML like that from ever getting this far. (Preceding program logic also ensures that this regex only ever sees strings bounded by balanced tags, so that's another worry gone.)

    Second, according to the spec, a Name is a little more restricted in its composition than what I've got here. But the regex will still pick out a Name and only a Name because its character class excludes whitespace and =.

    Third, I apologize for the really gross bit that captures the value of type, with the lookbehind and the conditional, but it solves a problem (namely, How do you capture just what's between the quotes, when you don't know what's allowed between the quotes until you've seen them?). It's easier to read if you imagine it says ($attValue) and just mentally supply the magic that breaks off the quote marks (and I coded such a version, using s///e, but it was slooooow compared to the lookbehind/conditional). If any XML geek reads this who happens to end up on a future spec committee, remember that giving everybody their favorite English quote character is nice, but it has a parsing cost.

    Fourth, if you're saying that weird transformations of XML like this really shouldn't be necessary, I personally believe you're absolutely correct. The XML we're transforming uses the same elements to do different things, yuck, in a way that XSD can't (currently) validate, so we're pretty much transforming it into the form that should have been specified in the first place, which we can validate with XSD (and which is more readable anyway).

      remember that giving everybody their favorite English quote character is nice, but it has a parsing cost.

      A real parser doesn't suffer from that problem.