Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

RegExps, Prematch and Postmatch without efficiency penalty

by davido (Cardinal)
on Sep 14, 2003 at 08:23 UTC ( [id://291363] : perlmeditation . print w/replies, xml ) Need Help??

Just about anyone who has spent any time reading the Perldocs, the Camel book, or Mastering Regular Expressions knows that the use of $` (aka, prematch), $' (aka, postmatch), and $& (aka, match) anywhere in your program will carry with it a non-trivial speed efficiency hit.

The Camel book states, (in the section entitled 'Time Efficiency') "Avoid $&, $`, and $'. Any occurrence in your program causes all matches to save the searched string for possible future reference."

The perldoc, perlre states, "Warning: Once Perl sees that you need one of $&, $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program..... But if you never use $&, $`, or $', then patterns without capturing parenthesis will not be penalized.".

We know by reading further in the perldocs that "As of 5.005, $& is not so costly as the other two." But wouldn't it be nice if there was a way to gain the functionality of the Prematch, Postmatch, and Match 'special variables' without tainting the entire program with the "Evil Sluggish Regexp" demon? There is.

Along came some new special variables. @-, and @+. The mnemonic for @- is "Last Match Start", and $-[0] is the offset of the start of the last successful match. @+'s mnemonic is "Last Match End." And $+[0] is the offset into the string of the end of the entire match.

Reading in 'perlvar' it tells us the following:

  • $` is the same as substr( $var, 0, $-[0] )
  • $' is the same as substr( $var, $+[0] )
  • $& is the same as substr( $var, $-[0], $+[0] - $-[0] )

Next, in the Perl Cookbook, in the section that discusses "Accessing Substrings", it says, "Although unpack is not lvaluable, it is considerably faster than substr when you extract numerous values at once." Now I haven't benchmarked this assertion, but for now I'll risk injury and insult by taking the authors' word that unpack is faster. I could be misintrepreting what the book is stating. It might be suggesting that extracting a list of things, as in "x6 a2 x5 a5" is quicker than using several substr's. Once I run the benchmarks I'll make the final decision on whether to use substr or unpack.

"Where is he going with all this?" you may be wondering by now. Well, in a spare moment I set out to create three small, efficient functions to mimick the Prematch, Postmatch, and Match special variables, while encapsulating and hiding the ugly and easily bungled specifics. My functions fall short of exactly mimicking the special variables in that they require that you pass to them the original string, or a copy of the original string. That means that if you use the substitution operator, "s///", you'll have to first make a copy of the unaltered string to pass to the function. Passing the altered version is going to wrek all sorts of havoc, and the universe may begin its great collapse (or you might just get the wrong return value).

Here is the code:

sub prematch { return unpack "a$-[0]", $_[0]; } sub postmatch { return unpack "x$+[0] a*", $_[0]; } sub match { my $len = $+[0] - $-[0]; unpack "x$-[0] a$len", $_[0]; }

And now here is a working example of using these three functions:

use strict; use warnings; my $string = "This is some pretty darn boring text"; if ( $string =~ /darn/ ) { print prematch($string), "\n"; print postmatch($string), "\n"; print match($string), "\n"; }

I have also tested the functions using the m//g modifier, and they still work as expected. In each iteration with /g, (assuming the $string has some repetiton in it that we're matching on) the functions return the proper portions of the string, which is what we want.

You may notice that I don't explicitly pass @- and @+ into the functions. I could easily change the functions to do that, but special variables are called, "Global Special Variables" for a reason; they're already global. I'm not stomping on namespace by leaving them that way, and they're always going to be there, as long as the match took place. If it didn't take place, you shouldn't be calling these functions anyway. If there is a compelling reason to pass these special variables into the functions I'd be interested in knowing, and I can update my functions accordingly. Another obvious shortcoming is that we are passing an actual string. I tried to minimize the overhead of doing so by not using 'shift' to pull $_[0] out of @_. Instead I passed $_[0] directly to unpack. I'm not sure that I gain anything by doing this. ...there's something else for me to benchmark. I might have set the functions up to take the string as a reference, but then it would have to be dereferenced before passing into unpack anyway, so I didn't see that as an efficiency gain.

What does all this accomplish? These functions provide an easy to use (and easy to remember) interface into gaining similar functionality to that provided by the special variables, $`, $&, and $', without the risk of beating your entire program with the sluggish stick. I hope I haven't duplicated functionality that already comes nicely packaged and modulized on CPAN. I didn't find it right away in searching, if it is already there. So I present these functions for further discussion, and maybe even practical use.

UPDATE: Thanks gmax for providing the benchmark comparisons between substr and unpack. I suspected from the outset that there might be some penalty involved in the fact that unpack requires that the template string be parsed or at least interpolated. Your benchmarks have born that out. I am not surprised to see that unpack becomes faster on strings greater than about 30k.

The point to using the most efficient method is simply to be as inobtrusive as possible, given that a couple subroutine calls are necessary to roughly emulate the functionality of simply refering to a scalar variable. It should be understood from the outset that it is not possible for an individual call to prematch() to be as fast as an individual use of $`. However, in the aggregate, using the prematch() function will not have a negative effect on every single other regexp in your entire program (including modules used in the program), as is the case with using $` and $'. You are trading off convenience and immediate speed efficiency for overall runtime efficiency throughout the remainder of the program.

So the answer is that in time critical applications where you are certain that you don't care about the cost of $` elsewhere in the program, use $` and $'. In applications where you do care about overall runtime performance, but are somewhat less critical of an individual use of a single regexp, use prematch() and postmatch(). And if you know your strings are going to be shorter than 30k, modify the functions to use substr instead of unpack.

I also wanted to point out that sometimes the whole issue can be rendered moot by simply looking at the problem through a different set of glasses. Why use m// with $`, $', and $&, if in many cases, split can do what you want. While behavior isn't identical, split offers the ability to capture the delimeter as though it were the match, and then treat what came before it and after it as the prematch and postmatch.

my ($pre, $match, $post) = split /\b(word)\b/, $string, 2;

This method has its limitations, but it also has possibilities that aren't easily implemented with simple matching and special variables.

.... End UPDATE ....


"If I had my life to do over again, I'd be a plumber." -- Albert Einstein

Replies are listed 'Best First'.
Re: RegExps, Prematch and Postmatch without efficiency penalty
by gmax (Abbot) on Sep 14, 2003 at 10:48 UTC
    Now I haven't benchmarked this assertion,

    Why such a hurry?

    but for now I'll risk injury and insult by taking the authors' word that unpack is faster.

    Hold on. Here it comes. :)

    It looks like substr is faster than unpack ...

    #!/usr/bin/perl -w use strict; use Benchmark qw(timethese); my $repeat = 10; my $text = ('abc' x $repeat) . 'gotcha' . ('xyz' x $repeat); my ($pre,$match,$post); print "OS: $^O - Perl: $]\n"; timethese( 100000, { 'unpack' => sub { if ($text =~ /gotcha/) { $pre = prematch($text); $post = postmatch($text); $match = match($text); } }, 'substr' => sub { if ($text =~ /gotcha/) { $pre = substr_prematch($text); $post = substr_postmatch($text); $match = substr_match($text); } }, } ); if ($text =~ /gotcha/) { print "unpack\n"; print "prematch :", prematch($text), "\n"; print "match :", match($text), "\n"; print "postmatch :", postmatch($text), "\n"; print "substring\n"; print "prematch :", substr_prematch($text), "\n"; print "match :", substr_match($text), "\n"; print "postmatch :", substr_postmatch($text), "\n"; } sub prematch { return unpack "a$-[0]", $_[0]; } sub postmatch { return unpack "x$+[0] a*", $_[0]; } sub match { my $len = $+[0] - $-[0]; unpack "x$-[0] a$len", $_[0]; } sub substr_match { substr( $_[0], $-[0], $+[0] - $-[0] ) } sub substr_prematch { substr( $_[0], 0, $-[0] ) } sub substr_postmatch { substr( $_[0], $+[0] ) } __END__ (output edited to fit the page better) OS: cygwin - Perl: 5.008 Benchmark: timing 100000 iterations of substr, unpack... substr: 2 wallclock secs ( 1.92 usr + 0.00 sys = 1.92 CPU) unpack: 11 wallclock secs (11.89 usr + 0.00 sys = 11.89 CPU) unpack prematch :abcabcabcabcabcabcabcabcabcabc match :gotcha postmatch :xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz substring prematch :abcabcabcabcabcabcabcabcabcabc match :gotcha postmatch :xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz OS: MSWin32 - Perl: 5.006001 Benchmark: timing 100000 iterations of substr, unpack... substr: 3 wallclock secs ( 1.93 usr + 0.00 sys = 1.93 CPU) unpack: 4 wallclock secs ( 3.87 usr + 0.00 sys = 3.87 CPU) unpack OS: linux - Perl: 5.006001 Benchmark: timing 100000 iterations of substr, unpack... substr: 2 wallclock secs ( 2.26 usr + 0.00 sys = 2.26 CPU) unpack: 4 wallclock secs ( 3.90 usr + 0.00 sys = 3.90 CPU) unpack

    Update (1) And the reason is the overhead of interpolating $-[0] and $+[0] in the unpack parameter. If you run the benchmark with unpack "a30" and unpack "a36" it will be faster than substr but useless in general. So The Perl Cookbook was right after all. However, the devil is in the detail ... ;)

    Update (2) The turning point is when the string is at least 5000 30,000 chars. With large strings, unpack becomes faster as advertised (Perl 5.6.1). Set $repeat to 5000, put an exit after timethese, and see the results. (Actually it is 5,000 groups of characters, thus creating a 30,000 chars string.)

    Update (3) The original subs can be made slightly faster using sprintf instead of a simple interpolation. Using this version, the turning point, where the unpack version becomes faster than the substr implementation, is down to 3500 chars groups (= 21,000 chars).

    sub prematch { unpack sprintf("a%d",$-[0]), $_[0]; } sub postmatch { unpack sprintf( "x%d a*", $+[0]) , $_[0]; } sub match { unpack sprintf ("x%d a%d", $-[0], $+[0] - $-[0] ), $_[0]; }
     _  _ _  _  
    (_|| | |(_|><

      Just out of interest, I thought I'd compare the gain from using constructs such as these compared with $`, $& and $'. I am no benchmark expert, but basically I added this, inside gmax's timethese() loop:

      'vanilla' => sub { if ($text =~ /gotcha/) { $pre = $`; $post = $'; $match = $&; } }, }

      and this inside the if ($text =~ /gotcha/) {} loop:

      print "vanilla\n"; print "prematch :", $`, "\n"; print "match :", $&, "\n"; print "postmatch :", $', "\n";

      The (slightly edited) output, for 500,000 iterations, is clear:

      Benchmark: timing 500000 iterations of substr, unpack, vanilla... substr: 5 wallclock secs <snip> @ 106678.05/s unpack: 8 wallclock secs <snip> @ 54241.70/s vanilla: 0 wallclock secs <snip> @ 492125.98/s

      So, on this basis alone, the much decried $`, $& and $' appear to be respectively almost five and nine times faster than the proposed 'substr' and 'unpack' subs.

      Of course, the real problem, in a programme of any length, does not lie here, but in the fact that "Any occurrence (of $`, $& and $') in your program causes all matches to save the searched string for possible future reference". So I tried modifying each timethese() loop by adding a fairly large number of subsequent match attempts:

      my ($i, $j, $k); for ('a' .. 'z', 'A' .. 'Z') { $i++ if $text2 =~ /$_/; } $j++ if $text3 =~ /quux/; $k++ if $text4 =~ /H/;

      (having, of course, approproately defined strings $text2, $text3 and $text4) and I was surprised to see the following output:

      substr: 25 wallclock secs ... unpack: 26 wallclock secs ... vanilla: 26 wallclock secs ...

      (Full code here:)

      Does this mean:

      1 The much decried $`, $& and $' need to be rehabilitated?

      2 There's a benchmarking problem?

      3 I've missed something (Most likely reply :-)?



        You have a benchmark problem.

        You see, the most vile aspect about $`, $& and $' is the fact that they affect every single regexp in your script or any module used. So if you benchmark it against other approaches that also use regular expressions, you'll slow them down, too. Conclusion: if you want to compare approaches, you should do benchmark them in separate scripts.

        You should also check out how speed compares between matches, especially on extremely long strings, with and without one of ($`, $&, $') being mentioned anywhere in your script. Because it's there that it matters, not where you actually make use of them, but everywhere else.