|We don't bite newbies here... much
RegExps, Prematch and Postmatch without efficiency penaltyby davido (Cardinal)
|on Sep 14, 2003 at 08:23 UTC
Just about anyone who has spent any time reading the Perldocs, the Camel book, or Mastering Regular Expressions knows that the use of $` (aka, prematch), $' (aka, postmatch), and $& (aka, match) anywhere in your program will carry with it a non-trivial speed efficiency hit.
The Camel book states, (in the section entitled 'Time Efficiency') "Avoid $&, $`, and $'. Any occurrence in your program causes all matches to save the searched string for possible future reference."
The perldoc, perlre states, "Warning: Once Perl sees that you need one of $&, $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program..... But if you never use $&, $`, or $', then patterns without capturing parenthesis will not be penalized.".
We know by reading further in the perldocs that "As of 5.005, $& is not so costly as the other two." But wouldn't it be nice if there was a way to gain the functionality of the Prematch, Postmatch, and Match 'special variables' without tainting the entire program with the "Evil Sluggish Regexp" demon? There is.
Along came some new special variables. @-, and @+. The mnemonic for @- is "Last Match Start", and $- is the offset of the start of the last successful match. @+'s mnemonic is "Last Match End." And $+ is the offset into the string of the end of the entire match.
Reading in 'perlvar' it tells us the following:
Next, in the Perl Cookbook, in the section that discusses "Accessing Substrings", it says, "Although unpack is not lvaluable, it is considerably faster than substr when you extract numerous values at once." Now I haven't benchmarked this assertion, but for now I'll risk injury and insult by taking the authors' word that unpack is faster. I could be misintrepreting what the book is stating. It might be suggesting that extracting a list of things, as in "x6 a2 x5 a5" is quicker than using several substr's. Once I run the benchmarks I'll make the final decision on whether to use substr or unpack.
"Where is he going with all this?" you may be wondering by now. Well, in a spare moment I set out to create three small, efficient functions to mimick the Prematch, Postmatch, and Match special variables, while encapsulating and hiding the ugly and easily bungled specifics. My functions fall short of exactly mimicking the special variables in that they require that you pass to them the original string, or a copy of the original string. That means that if you use the substitution operator, "s///", you'll have to first make a copy of the unaltered string to pass to the function. Passing the altered version is going to wrek all sorts of havoc, and the universe may begin its great collapse (or you might just get the wrong return value).
Here is the code:
And now here is a working example of using these three functions:
I have also tested the functions using the m//g modifier, and they still work as expected. In each iteration with /g, (assuming the $string has some repetiton in it that we're matching on) the functions return the proper portions of the string, which is what we want.
You may notice that I don't explicitly pass @- and @+ into the functions. I could easily change the functions to do that, but special variables are called, "Global Special Variables" for a reason; they're already global. I'm not stomping on namespace by leaving them that way, and they're always going to be there, as long as the match took place. If it didn't take place, you shouldn't be calling these functions anyway. If there is a compelling reason to pass these special variables into the functions I'd be interested in knowing, and I can update my functions accordingly. Another obvious shortcoming is that we are passing an actual string. I tried to minimize the overhead of doing so by not using 'shift' to pull $_ out of @_. Instead I passed $_ directly to unpack. I'm not sure that I gain anything by doing this. ...there's something else for me to benchmark. I might have set the functions up to take the string as a reference, but then it would have to be dereferenced before passing into unpack anyway, so I didn't see that as an efficiency gain.
What does all this accomplish? These functions provide an easy to use (and easy to remember) interface into gaining similar functionality to that provided by the special variables, $`, $&, and $', without the risk of beating your entire program with the sluggish stick. I hope I haven't duplicated functionality that already comes nicely packaged and modulized on CPAN. I didn't find it right away in searching, if it is already there. So I present these functions for further discussion, and maybe even practical use.
UPDATE: Thanks gmax for providing the benchmark comparisons between substr and unpack. I suspected from the outset that there might be some penalty involved in the fact that unpack requires that the template string be parsed or at least interpolated. Your benchmarks have born that out. I am not surprised to see that unpack becomes faster on strings greater than about 30k.
The point to using the most efficient method is simply to be as inobtrusive as possible, given that a couple subroutine calls are necessary to roughly emulate the functionality of simply refering to a scalar variable. It should be understood from the outset that it is not possible for an individual call to prematch() to be as fast as an individual use of $`. However, in the aggregate, using the prematch() function will not have a negative effect on every single other regexp in your entire program (including modules used in the program), as is the case with using $` and $'. You are trading off convenience and immediate speed efficiency for overall runtime efficiency throughout the remainder of the program.
So the answer is that in time critical applications where you are certain that you don't care about the cost of $` elsewhere in the program, use $` and $'. In applications where you do care about overall runtime performance, but are somewhat less critical of an individual use of a single regexp, use prematch() and postmatch(). And if you know your strings are going to be shorter than 30k, modify the functions to use substr instead of unpack.
I also wanted to point out that sometimes the whole issue can be rendered moot by simply looking at the problem through a different set of glasses. Why use m// with $`, $', and $&, if in many cases, split can do what you want. While behavior isn't identical, split offers the ability to capture the delimeter as though it were the match, and then treat what came before it and after it as the prematch and postmatch.
This method has its limitations, but it also has possibilities that aren't easily implemented with simple matching and special variables.
.... End UPDATE ....
"If I had my life to do over again, I'd be a plumber." -- Albert Einstein