Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
The task is pulling apart and reassembling some elements of an XML application. (Using someone else's existing, well-written XML parser is not an option because the reassembled version needs to preserve the line-by-line layout of the original.) The program feeds strings to the following regex to munge them; the strings themselves are guaranteed to begin with <someElementName and to end with the corresponding (balanced) </someElementName>, and this regex only needs to worry about munging those outermost tags.
The regex is defined thus:
$switchable = qr/foo|quux|glarf|boojum/; $name = qr/[^\s<>=]+/; # broader than the XML Spec, but that's OK $attValue = qr/(?:'[^'<>]*'|"[^"<>]*")/; $attribute = qr/(?:\s+(?>$name)\s*=\s*(?>$attValue))/; s/ ^ < ($switchable) ($attribute*?) # get the value of the type attribute (\s+) type (\s*=\s*) (['"]) # either quote allowable # but then the other quote char # or just about anything else # is allowed inside ... # minimal match for efficiency ((?(?<=')[^']*?|[^"]*?)) \5 # matching close quote ($attribute*?) (\s*) > (.*) # what's between tags, even newline <\/ \1 # matching end tag (\s*) > $ /<!--switch--><$6$2$3metaType$4$5$1$5$7$8>$9<\/$6$10>/sox
The expression works fine without any of the minimal-match or atomic-match modification you see here, so this is strictly an optimization problem, not a make-it-work problem.
The strings it will be applied to include both the elements it recognizes (foo, quux, glarf, boojum) and others (which it can ignore). Usually the ones it does recognize will have a type attribute, but sometimes not (in which case it can also ignore them). Usually type will be their only attribute, but very rarely not (in which case it needs to leave the other attributes alone). I know it matches fine, but I also need it to fail fine, i.e., fail without going exponential, even if the tag pair are separated by ten or twenty lines of intervening XML.
I have changed from greedy to minimal-match in several places where the matched element is expected to be short or absent and there is no danger that the next regex element could erroneously match too far to the left (this is the case with the matches to $attribute, which will usually match nothing in the sort of files I'm processing, and also the case when matching the value of the type attribute). I believe this will cut down on the amount of backtracking involved in a typical case.
I have also gotten some (?> ) modifiers in wherever two indefinite quantifiers enclose something that matches rather liberally (like $name or $attValue above). I believe I could use them around all such inner quantifiers, including the quantifiers on more limited matches like \s, but I'm not certain if it's useful.
In great measure, this is because I don't really know all the things that are or aren't already optimized in the Perl regex engine.
So, O monks, you have free rein to turn any of my greedy quantifiers to stingy or vice versa, and to put atomic matching around anything. You can even use 5.10 possessive quantifier notation if you want, and I'll translate it back to 5.8 style for my own use. What would you do to optimize this, and why would you do so? Thanks in advance.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Regex optimization: Can (?> ) and minimal match help here?
by ikegami (Patriarch) on Jun 17, 2008 at 02:59 UTC | |
|
Re: Regex optimization: Can (?> ) and minimal match help here?
by moritz (Cardinal) on Jun 17, 2008 at 08:41 UTC | |
by ikegami (Patriarch) on Jun 17, 2008 at 12:45 UTC | |
by moritz (Cardinal) on Jun 18, 2008 at 16:15 UTC | |
|
Re: Regex optimization: Can (?> ) and minimal match help here?
by eritain (Novice) on Jun 17, 2008 at 21:37 UTC | |
by ikegami (Patriarch) on Jun 17, 2008 at 23:43 UTC |