Re: RegExps, Prematch and Postmatch without efficiency penalty

Now I haven't benchmarked this assertion,

Why such a hurry?

but for now I'll risk injury and insult by taking the authors' word that unpack is faster.

Hold on. Here it comes. :)

It looks like substr is faster than unpack ...

#!/usr/bin/perl -w
use strict;
use Benchmark qw(timethese);

my $repeat = 10;
my $text = ('abc' x $repeat) . 'gotcha' . ('xyz' x $repeat);

my ($pre,$match,$post);

print "OS: $^O - Perl: $]\n";

timethese( 100000, {
    'unpack' => sub {
            if ($text =~ /gotcha/) {
                $pre = prematch($text);
                $post = postmatch($text);
                $match = match($text);
            }
        },
    'substr' => sub {
            if ($text =~ /gotcha/) {
                $pre = substr_prematch($text);
                $post = substr_postmatch($text);
                $match = substr_match($text);
            }
        },
    }
);

if ($text =~ /gotcha/) {
    print "unpack\n";
    print "prematch  :", prematch($text), "\n";
    print "match     :", match($text), "\n";
    print "postmatch :", postmatch($text), "\n";

    print "substring\n";
    print "prematch  :", substr_prematch($text), "\n";
    print "match     :", substr_match($text), "\n";
    print "postmatch :", substr_postmatch($text), "\n";
}

sub prematch {
    return unpack "a$-[0]", $_[0];
}

sub postmatch {
    return unpack "x$+[0] a*", $_[0];
}

sub match {
    my $len = $+[0] - $-[0];
    unpack "x$-[0] a$len", $_[0];
}

sub substr_match {
    substr( $_[0], $-[0], $+[0] - $-[0] )
}

sub substr_prematch {
    substr( $_[0], 0, $-[0] )
}

sub substr_postmatch {
    substr( $_[0],  $+[0] )
}
__END__
(output edited to fit the page better)

OS: cygwin - Perl: 5.008
Benchmark: timing 100000 iterations of substr, unpack...
    substr:  2 wallclock secs ( 1.92 usr +  0.00 sys =  1.92 CPU)
    unpack: 11 wallclock secs (11.89 usr +  0.00 sys = 11.89 CPU)
unpack
prematch  :abcabcabcabcabcabcabcabcabcabc
match     :gotcha
postmatch :xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz
substring
prematch  :abcabcabcabcabcabcabcabcabcabc
match     :gotcha
postmatch :xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz

OS: MSWin32 - Perl: 5.006001
Benchmark: timing 100000 iterations of substr, unpack...
    substr:  3 wallclock secs ( 1.93 usr +  0.00 sys =  1.93 CPU)
    unpack:  4 wallclock secs ( 3.87 usr +  0.00 sys =  3.87 CPU)
unpack

OS: linux - Perl: 5.006001
Benchmark: timing 100000 iterations of substr, unpack...
    substr:  2 wallclock secs ( 2.26 usr +  0.00 sys =  2.26 CPU)
    unpack:  4 wallclock secs ( 3.90 usr +  0.00 sys =  3.90 CPU)
unpack
[download]

Update (1) And the reason is the overhead of interpolating $-[0] and $+[0] in the unpack parameter. If you run the benchmark with unpack "a30" and unpack "a36" it will be faster than substr but useless in general. So The Perl Cookbook was right after all. However, the devil is in the detail ... ;)

Update (2) The turning point is when the string is at least ~~5000~~ 30,000 chars. With large strings, unpack becomes faster as advertised (Perl 5.6.1). Set $repeat to 5000, put an exit after timethese, and see the results. (Actually it is 5,000 groups of characters, thus creating a 30,000 chars string.)

Update (3) The original subs can be made slightly faster using sprintf instead of a simple interpolation. Using this version, the turning point, where the unpack version becomes faster than the substr implementation, is down to 3500 ~~chars~~ groups (= 21,000 chars).

sub prematch {
    unpack sprintf("a%d",$-[0]), $_[0];
}

sub postmatch {
    unpack sprintf( "x%d a*", $+[0]) , $_[0];
}

sub match {
    unpack sprintf ("x%d a%d", $-[0], $+[0] - $-[0] ), $_[0];
}
[download]

 _  _ _  _  
(_|| | |(_|><
 _|

Comment on Re: RegExps, Prematch and Postmatch without efficiency penalty Select or Download Code

Replies are listed 'Best First'.
Re: Re: RegExps, Prematch and Postmatch without efficiency penalty by Not_a_Number (Prior) on Sep 14, 2003 at 15:20 UTC
Just out of interest, I thought I'd compare the gain from using constructs such as these compared with $`, `$&` and `$'`. I am no benchmark expert, but basically I added this, inside gmax's `timethese()` loop: 'vanilla' => sub { if ($text =~ /gotcha/) { $pre = $`; $post = $'; $match = $&; } }, } [download] and this inside the `if ($text =~ /gotcha/) {}` loop: print "vanilla\n"; print "prematch :", $`, "\n"; print "match :", $&, "\n"; print "postmatch :", $', "\n"; [download] The (slightly edited) output, for 500,000 iterations, is clear: `Benchmark: timing 500000 iterations of substr, unpack, vanilla... substr: 5 wallclock secs <snip> @ 106678.05/s unpack: 8 wallclock secs <snip> @ 54241.70/s vanilla: 0 wallclock secs <snip> @ 492125.98/s` [download] So, on this basis alone, the much decried $`, `$&` and `$'` appear to be respectively almost five and nine times faster than the proposed 'substr' and 'unpack' subs. Of course, the real problem, in a programme of any length, does not lie here, but in the fact that "Any occurrence (of $`, `$&` and `$'`) in your program causes all matches to save the searched string for possible future reference". So I tried modifying each `timethese()` loop by adding a fairly large number of subsequent match attempts: `my ($i, $j, $k); for ('a' .. 'z', 'A' .. 'Z') { $i++ if $text2 =~ /$_/; } $j++ if $text3 =~ /quux/; $k++ if $text4 =~ /H/;` [download] (having, of course, approproately defined strings $text2, $text3 and $text4) and I was surprised to see the following output: `substr: 25 wallclock secs ... unpack: 26 wallclock secs ... vanilla: 26 wallclock secs ...` [download] (Full code here:) Read more... (3 kB) Does this mean: 1 The much decried $`, `$&` and `$'` need to be rehabilitated? 2 There's a benchmarking problem? 3 I've missed something (Most likely reply :-)? thanks dave	[reply] [d/l] [select]
Re: Re: Re: RegExps, Prematch and Postmatch without efficiency penalty by bart (Canon) on Sep 14, 2003 at 16:00 UTC
You have a benchmark problem. You see, the most vile aspect about $`, $& and $' is the fact that they affect every single regexp in your script or any module used. So if you benchmark it against other approaches that also use regular expressions, you'll slow them down, too. Conclusion: if you want to compare approaches, you should do benchmark them in separate scripts. You should also check out how speed compares between matches, especially on extremely long strings, with and without one of ($`, $&, $') being mentioned anywhere in your script. Because it's there that it matters, not where you actually make use of them, but everywhere else.	[reply]
Re: Re: Re: Re: RegExps, Prematch and Postmatch without efficiency penalty by liz (Monsignor) on Sep 14, 2003 at 16:27 UTC
...you should do benchmark them in separate scripts. This is similar to the problem I had trying to benchmark the memory footprint of threaded applications. See Benchmark::Thread::Size for an example of how I embedded scripts within scripts to get a clean memory benchmark. I think a similar (probably simpler) approach could be taken to benchmark alternate approaches to $`, $& and $'. Liz	[reply]
Re: Re: Re: Re: Re: RegExps, Prematch and Postmatch without efficiency penalty by demerphq (Chancellor) on Sep 16, 2003 at 09:41 UTC