Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Re: RegExps, Prematch and Postmatch without efficiency penalty

by Not_a_Number (Prior)
on Sep 14, 2003 at 15:20 UTC ( [id://291402]=note: print w/replies, xml ) Need Help??


in reply to Re: RegExps, Prematch and Postmatch without efficiency penalty
in thread RegExps, Prematch and Postmatch without efficiency penalty

Just out of interest, I thought I'd compare the gain from using constructs such as these compared with $`, $& and $'. I am no benchmark expert, but basically I added this, inside gmax's timethese() loop:

'vanilla' => sub { if ($text =~ /gotcha/) { $pre = $`; $post = $'; $match = $&; } }, }

and this inside the if ($text =~ /gotcha/) {} loop:

print "vanilla\n"; print "prematch :", $`, "\n"; print "match :", $&, "\n"; print "postmatch :", $', "\n";

The (slightly edited) output, for 500,000 iterations, is clear:

Benchmark: timing 500000 iterations of substr, unpack, vanilla... substr: 5 wallclock secs <snip> @ 106678.05/s unpack: 8 wallclock secs <snip> @ 54241.70/s vanilla: 0 wallclock secs <snip> @ 492125.98/s

So, on this basis alone, the much decried $`, $& and $' appear to be respectively almost five and nine times faster than the proposed 'substr' and 'unpack' subs.

Of course, the real problem, in a programme of any length, does not lie here, but in the fact that "Any occurrence (of $`, $& and $') in your program causes all matches to save the searched string for possible future reference". So I tried modifying each timethese() loop by adding a fairly large number of subsequent match attempts:

my ($i, $j, $k); for ('a' .. 'z', 'A' .. 'Z') { $i++ if $text2 =~ /$_/; } $j++ if $text3 =~ /quux/; $k++ if $text4 =~ /H/;

(having, of course, approproately defined strings $text2, $text3 and $text4) and I was surprised to see the following output:

substr: 25 wallclock secs ... unpack: 26 wallclock secs ... vanilla: 26 wallclock secs ...

(Full code here:)

use strict; use warnings; use Benchmark qw(timethese); open OUT, '>', 'temp.txt' or die "can't open $!"; select OUT; my $repeat = 10; my $text = ('abc' x $repeat) . 'gotcha' . ('xyz' x $repeat); my $text2 = 'Fork over rice before serving'; my $text3 = 'foo bar baz quux'; my $text4 = 'Just Another Perl hacker'; my ($pre,$match,$post); print "OS: $^O - Perl: $]\n"; timethese( 100000, { 'unpack' => sub { if ($text =~ /gotcha/) { $pre = prematch($text); $post = postmatch($text); $match = match($text); } my ($i, $j, $k); for ('a' .. 'z', 'A' .. 'Z') { $i++ if $text2 =~ /$_/; } $j++ if $text3 =~ /quux/; $k++ if $text4 =~ /H/; }, 'substr' => sub { if ($text =~ /gotcha/) { $pre = substr_prematch($text); $post = substr_postmatch($text); $match = substr_match($text); } my ($i, $j, $k); for ('a' .. 'z', 'A' .. 'Z') { $i++ if $text2 =~ /$_/; } $j++ if $text3 =~ /quux/; $k++ if $text4 =~ /H/; }, 'vanilla' => sub { if ($text =~ /gotcha/) { $pre = $`; $post = $'; $match = $&; } my ($i, $j, $k); for ('a' .. 'z', 'A' .. 'Z') { $i++ if $text2 =~ /$_/; } $j++ if $text3 =~ /quux/; $k++ if $text4 =~ /H/; }, } ); if ($text =~ /gotcha/) { print "unpack\n"; print "prematch :", prematch($text), "\n"; print "match :", match($text), "\n"; print "postmatch :", postmatch($text), "\n"; print "substring\n"; print "prematch :", substr_prematch($text), "\n"; print "match :", substr_match($text), "\n"; print "postmatch :", substr_postmatch($text), "\n"; print "vanilla\n"; print "prematch :", $`, "\n"; print "match :", $&, "\n"; print "postmatch :", $', "\n"; } sub prematch { return unpack "a$-[0]", $_[0]; } sub postmatch { return unpack "x$+[0] a*", $_[0]; } sub match { my $len = $+[0] - $-[0]; unpack "x$-[0] a$len", $_[0]; } sub substr_match { substr( $_[0], $-[0], $+[0] - $-[0] ) } sub substr_prematch { substr( $_[0], 0, $-[0] ) } sub substr_postmatch { substr( $_[0], $+[0] ) }

Does this mean:

1 The much decried $`, $& and $' need to be rehabilitated?

2 There's a benchmarking problem?

3 I've missed something (Most likely reply :-)?

thanks

dave

Replies are listed 'Best First'.
Re: Re: Re: RegExps, Prematch and Postmatch without efficiency penalty
by bart (Canon) on Sep 14, 2003 at 16:00 UTC
    You have a benchmark problem.

    You see, the most vile aspect about $`, $& and $' is the fact that they affect every single regexp in your script or any module used. So if you benchmark it against other approaches that also use regular expressions, you'll slow them down, too. Conclusion: if you want to compare approaches, you should do benchmark them in separate scripts.

    You should also check out how speed compares between matches, especially on extremely long strings, with and without one of ($`, $&, $') being mentioned anywhere in your script. Because it's there that it matters, not where you actually make use of them, but everywhere else.

      ...you should do benchmark them in separate scripts.

      This is similar to the problem I had trying to benchmark the memory footprint of threaded applications. See Benchmark::Thread::Size for an example of how I embedded scripts within scripts to get a clean memory benchmark. I think a similar (probably simpler) approach could be taken to benchmark alternate approaches to $`, $& and $'.

      Liz

        for an example of how I embedded scripts within scripts to get a clean memory benchmark.

        I started work on a generic tool to do this as part of my OO::Benchmark stuff. Never got around to finishing it though.


        ---
        demerphq

        <Elian> And I do take a kind of perverse pleasure in having an OO assembly language...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://291402]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-03-29 08:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found