Re^2: Regex for Differentiating Underscore and Whitespace

In another response, grinder scrutinized your problem solution and offered a more efficient solution based on the index() function without any regular expressions.

I personally believe that the claim about efficiency is not correct, since that kind of regex should get optimized to index anyway - and often regexen have a more immediately readable syntax. For a Perl programmer that is...

I hope that the following minimal benchmark can shed some light:

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw/cmpthese :hireswallclock/;

my @a = do {
    my @chr=(grep /\w/, map chr, 1..255);
    map { 
        local $_ = join '', map $chr[rand @chr], 1..1000;
        tr/_/ / if .5<rand;
        $_;
    } 1..1000;
};

cmpthese 5000 => { Regex => sub () { grep !/_/, @a },
                   Index => sub () { grep index($_, '_') < 0, @a } };

__END__
[download]

I get e.g.

C:\temp>perl index.pl
       Rate Index Regex
Index 891/s    --   -0%
Regex 891/s    0%    --
[download]

and

blazar@perlmonk ~ $ perl index.pl
       Rate Index Regex
Index 261/s    --   -0%
Regex 262/s    0%    --
[download]

on two different systems.

Now, is this test flawed? I easily tend to get these kinda things wrong, I must admit...

Comment on Re^2: Regex for Differentiating Underscore and Whitespace Select or Download Code

Replies are listed 'Best First'.
Re^3: Regex for Differentiating Underscore and Whitespace by mwah (Hermit) on Nov 04, 2007 at 14:09 UTC
blazar: Now, is this test flawed? You are basically correct here. I was too zealous here to advertise the vantages of index() and tr//. They have their run elsewhere, but not in this special case. Thanks for pointing this out. I abused your benchmark code (of course) to find out on how good the index() optimization in Perl5 really is ;-) ... use Benchmark qw/cmpthese :hireswallclock/; my @a = map { my $s='PM is cool, ' x 10_000; substr($s, rand(length $s), 1, '_'); $s } 1..1000; cmpthese -3 => { C_Idx => sub () { grep C_Idx($_, '_') < 0, @a }, Index => sub () { grep index($_, '_') < 0, @a }, Regex => sub () { grep ! /_/, @a }, Tr => sub () { grep ! tr/_//, @a } }; use Inline C => qq[ int C_Idx(SV* src, SV* chr) { STRLEN srclen, chrlen; char ssrc = SvPV(src, srclen), schr = SvPV(chr, chrlen); char p = ssrc; if( chrlen != 1 ) croak("single characters only for now!"); return (p=memchr(p, schr, srclen)) != NULL ? p-ssrc : -1; } ]; ... [download] On my system, somehow above 60-70K strings - the index() falls behind the c-library function for finding a character (memchr). For the above strings: `Rate Tr Regex Index C_Idx Tr 3.17/s -- -74% -74% -87% Regex 12.2/s 284% -- -0% -52% Index 12.2/s 285% 0% -- -52% C_Idx 25.2/s 696% 107% 107% --` [download] I personally believe it'd be much better If I'd read my own posts and think about their assumptions next time much more thoroughly ;-) Regards mwa	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Regex for Differentiating Underscore and Whitespace
by mwah (Hermit) on Nov 04, 2007 at 14:09 UTC

blazar

Now, is this test flawed?

You are basically correct here. I was too zealous here to advertise the vantages of index() and tr//. They have their run elsewhere, but not in this special case. Thanks for pointing this out.

I abused your benchmark code (of course) to find out on how good the index() optimization in Perl5 really is ;-)

...
use Benchmark qw/cmpthese :hireswallclock/;

my @a = map { my $s='PM is cool, ' x 10_000;
              substr($s, rand(length $s), 1, '_');
              $s
            } 1..1000;

cmpthese -3 => {
                C_Idx => sub () { grep C_Idx($_, '_') < 0, @a },
                Index => sub () { grep index($_, '_') < 0, @a },
                Regex => sub () { grep ! /_/,    @a },
                Tr    => sub () { grep ! tr/_//, @a }
               };

use Inline C => qq[
 int C_Idx(SV* src, SV* chr)
{
 STRLEN srclen, chrlen;
 char *ssrc = SvPV(src, srclen), *schr = SvPV(chr, chrlen);
 char *p = ssrc;
 if( chrlen != 1 ) croak("single characters only for now!");
 return (p=memchr(p, *schr, srclen)) != NULL ? p-ssrc : -1;
}
];
...
[download]

On my system, somehow above 60-70K strings - the index() falls behind the c-library function for finding a character (memchr). For the above strings:

        Rate    Tr Regex Index C_Idx
Tr    3.17/s    --  -74%  -74%  -87%
Regex 12.2/s  284%    --   -0%  -52%
Index 12.2/s  285%    0%    --  -52%
C_Idx 25.2/s  696%  107%  107%    --
[download]

I personally believe it'd be much better If I'd read my own posts and think about their assumptions next time much more thoroughly ;-)

Regards

mwa

[reply]
[d/l]
[select]