Re: Re: RegExps, Prematch and Postmatch without efficiency penalty

Just out of interest, I thought I'd compare the gain from using constructs such as these compared with $`, $& and $'. I am no benchmark expert, but basically I added this, inside gmax's timethese() loop:

    'vanilla' => sub {
            if ($text =~ /gotcha/) {
                $pre = $`;
                $post = $';
                $match = $&;
            }
        },
    }
[download]

and this inside the if ($text =~ /gotcha/) {} loop:

    print "vanilla\n";
    print "prematch  :", $`, "\n";
    print "match     :", $&, "\n";
    print "postmatch :", $', "\n";
[download]

The (slightly edited) output, for 500,000 iterations, is clear:

Benchmark: timing 500000 iterations of substr, unpack, vanilla...
substr:  5 wallclock secs <snip> @ 106678.05/s
unpack:  8 wallclock secs <snip> @ 54241.70/s
vanilla: 0 wallclock secs <snip> @ 492125.98/s
[download]

So, on this basis alone, the much decried $`, $& and $' appear to be respectively almost five and nine times faster than the proposed 'substr' and 'unpack' subs.

Of course, the real problem, in a programme of any length, does not lie here, but in the fact that "Any occurrence (of $`, $& and $') in your program causes all matches to save the searched string for possible future reference". So I tried modifying each timethese() loop by adding a fairly large number of subsequent match attempts:

my ($i, $j, $k);
for ('a' .. 'z', 'A' .. 'Z') {
  $i++ if $text2 =~ /$_/;
}
$j++ if $text3 =~ /quux/;
$k++ if $text4 =~ /H/;
[download]

(having, of course, approproately defined strings $text2, $text3 and $text4) and I was surprised to see the following output:

substr: 25 wallclock secs ...
unpack: 26 wallclock secs ...
vanilla: 26 wallclock secs ...
[download]

(Full code here:)

use strict;
use warnings;
use Benchmark qw(timethese);

open OUT, '>', 'temp.txt' or die "can't open $!";
select OUT;

my $repeat = 10;
my $text = ('abc' x $repeat) . 'gotcha' . ('xyz' x $repeat);
my $text2 = 'Fork over rice before serving';
my $text3 = 'foo bar baz quux';
my $text4 = 'Just Another Perl hacker';

my ($pre,$match,$post);

print "OS: $^O - Perl: $]\n";

timethese( 100000, {
   'unpack' => sub {
      if ($text =~ /gotcha/) {
         $pre = prematch($text);
         $post = postmatch($text);
         $match = match($text);
      }
      my ($i, $j, $k);
      for ('a' .. 'z', 'A' .. 'Z') {
         $i++ if $text2 =~ /$_/;
      }
      $j++ if $text3 =~ /quux/;
      $k++ if $text4 =~ /H/;
   },
   'substr' => sub {
      if ($text =~ /gotcha/) {
         $pre = substr_prematch($text);
         $post = substr_postmatch($text);
         $match = substr_match($text);
      }
      my ($i, $j, $k);
      for ('a' .. 'z', 'A' .. 'Z') {
         $i++ if $text2 =~ /$_/;
      }
      $j++ if $text3 =~ /quux/;
      $k++ if $text4 =~ /H/;
   },
   'vanilla' => sub {
      if ($text =~ /gotcha/) {
         $pre = $`;
         $post = $';
         $match = $&;
      }
      my ($i, $j, $k);
      for ('a' .. 'z', 'A' .. 'Z') {
         $i++ if $text2 =~ /$_/;
      }
      $j++ if $text3 =~ /quux/;
      $k++ if $text4 =~ /H/;
   },
 }
);

if ($text =~ /gotcha/) {
    print "unpack\n";
    print "prematch  :", prematch($text), "\n";
    print "match     :", match($text), "\n";
    print "postmatch :", postmatch($text), "\n";

    print "substring\n";
    print "prematch  :", substr_prematch($text), "\n";
    print "match     :", substr_match($text), "\n";
    print "postmatch :", substr_postmatch($text), "\n";

    print "vanilla\n";
    print "prematch  :", $`, "\n";
    print "match     :", $&, "\n";
    print "postmatch :", $', "\n";
}

sub prematch {
    return unpack "a$-[0]", $_[0];
}

sub postmatch {
    return unpack "x$+[0] a*", $_[0];
}

sub match {
    my $len = $+[0] - $-[0];
    unpack "x$-[0] a$len", $_[0];
}

sub substr_match {
    substr( $_[0], $-[0], $+[0] - $-[0] )
}

sub substr_prematch {
    substr( $_[0], 0, $-[0] )
}

sub substr_postmatch {
    substr( $_[0],  $+[0] )
}
[download]

Does this mean:

1 The much decried $`, $& and $' need to be rehabilitated?

2 There's a benchmarking problem?

3 I've missed something (Most likely reply :-)?

thanks

dave

Comment on Re: Re: RegExps, Prematch and Postmatch without efficiency penalty Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: RegExps, Prematch and Postmatch without efficiency penalty by bart (Canon) on Sep 14, 2003 at 16:00 UTC
You have a benchmark problem. You see, the most vile aspect about $`, $& and $' is the fact that they affect every single regexp in your script or any module used. So if you benchmark it against other approaches that also use regular expressions, you'll slow them down, too. Conclusion: if you want to compare approaches, you should do benchmark them in separate scripts. You should also check out how speed compares between matches, especially on extremely long strings, with and without one of ($`, $&, $') being mentioned anywhere in your script. Because it's there that it matters, not where you actually make use of them, but everywhere else.	[reply]
Re: Re: Re: Re: RegExps, Prematch and Postmatch without efficiency penalty by liz (Monsignor) on Sep 14, 2003 at 16:27 UTC
...you should do benchmark them in separate scripts. This is similar to the problem I had trying to benchmark the memory footprint of threaded applications. See Benchmark::Thread::Size for an example of how I embedded scripts within scripts to get a clean memory benchmark. I think a similar (probably simpler) approach could be taken to benchmark alternate approaches to $`, $& and $'. Liz	[reply]
Re: Re: Re: Re: Re: RegExps, Prematch and Postmatch without efficiency penalty by demerphq (Chancellor) on Sep 16, 2003 at 09:41 UTC
for an example of how I embedded scripts within scripts to get a clean memory benchmark. I started work on a generic tool to do this as part of my OO::Benchmark stuff. Never got around to finishing it though. --- demerphq _{<Elian> And I do take a kind of perverse pleasure in having an OO assembly language...}	[reply] [d/l]