Re^8: Fast Replacement (0.01 seconds)

Just for fun... I created an nth_index in C++ that sorta mimics the behavior of index. ..."sorta", in that it will return -1 if the nth string isn't found, and will assume an offset greater than haystack length is equal to the haystack size-1, or less than zero is zero. But where it will most obviously fall short is that it's only capable of dealing in standard "byte" sized chars; no Unicode.

use Benchmark qw( cmpthese );

use Inline CPP => 'DATA';

$main::string = '1234!' x 55e3;

sub cpp {
  my $copy = $main::string;
  my $ix = nth_index( $copy, '!', 50000 );
  substr( $copy, 0, $ix ) =~ tr/!/\n/;
  return \$copy;
}

sub regex {
  my $copy = $main::string;
  $myregexp::count = 0;
  $copy =~ s/!(??{ ( $myregexp::count++ < 50000 ) ? '' : '(?!)' })/\n/
+g;
  return \$copy;
}

sub buk {
  my( $p, $c ) = ( 0, 50e3 );
  my $copy = $main::string;
  1 while --$c and $p = index $copy, '!', $p;
  substr( $copy, 0, $p ) =~ tr[!][\n];
  return \$copy;
}
  

cmpthese ( -5, {
  cpp   => \&cpp,
  regex => \&regex,
  buk   => \&buk,
} );

__DATA__
__CPP__

#include <string>

long
  nth_index (
    const char* c_str_haystack,
    const char* c_str_needle,
    long n,
    long offset = 0
  ) {
    std::string haystack( c_str_haystack );
    std::string needle( c_str_needle );
    
    if( ! needle.size() || ! haystack.size() ) return -1;
    if( offset < 0 ) offset = 0;
    if( offset >= haystack.size() ) offset = haystack.size() - 1;
    size_t pos = haystack.find( needle, offset );
    while( --n ) {
      pos = haystack.find( needle, pos + 1 );
      if( pos == std::string::npos ) return -1;
    }
    return pos;
  }
[download]

The results:

$ ./mytest10.pl
        Rate regex   buk   cpp
regex 10.0/s    --  -97%  -99%
buk    291/s 2803%    --  -63%
cpp    777/s 7657%  167%    --
[download]

This goes back to the original benchmark where I just made copies. I understood the xor-flag method, but I already had this written and didn't bother to rewrite.

There's a gross inefficiency in the C++ version, that if fixed, would greatly improve the speed of that solution. The biggest inefficiency is that I'm copying the c-strings that Perl passes into the subroutine into C++ "string" objects. This is a linear operation, and really totally unnecessary if I were motivated enough to forgo the "string::find" method and roll my own instead. If I were to do that I could just deal directly with C strings and avoid making another copy. But I wanted to play with C++ strings just to see how it would look.

Nevertheless, the results are predictably good. Don't try this on Unicode strings though. And if the needle string has a length greater than one, we would have to be aware that my method will detect needles that overlap each other. If we wanted to avoid that the C++ function should add the length of the needle to "pos".

The exercise was beneficial, as I discovered a bug in Inline::CPP's documentation on how to declare user typemaps, that I'll be able to fix in an upcoming release.

Dave

Comment on Re^8: Fast Replacement (0.01 seconds) Select or Download Code

Replies are listed 'Best First'.
Re^9: Fast Replacement (0.01 seconds) by BrowserUk (Patriarch) on Jun 16, 2013 at 18:39 UTC
I came up with a similar I::C implementation. I'd run it against yours but yours doesn't compile on my machine at the moment: `int indexNth( SV haystack, SVneedle, int nth ) { STRLEN lh, ln, i, j; char h = SvPV( haystack, lh ); char n = SvPV( needle, ln ); for( i = 0; i < lh; ++i ) { for( j = 0; j < ln; ++j ) { if( h[ i + j ] != n[ j ] ) goto nomatch; } if( --nth == 0 ) return i; nomatch:; } return -1; }` [download] Got the C++ to compile -- missing newline at the end of the file, These are the results: `C:\test>1039228.pl Rate regex buk cpp c regex 4.06/s -- -97% -99% -100% buk 137/s 3277% -- -75% -85% cpp 546/s 13345% 298% -- -40% c 915/s 22446% 568% 68% --` [download] And the additional test: `sub c { my $copy = $main::string; my $ix = indexNth( $copy, '!', 50000 ); substr( $copy, 0, $ix ) =~ tr/!/\n/; return \$copy; }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^9: Fast Replacement (0.01 seconds)
by BrowserUk (Patriarch) on Jun 16, 2013 at 18:39 UTC

I came up with a similar I::C implementation. I'd run it against yours but yours doesn't compile on my machine at the moment:

int indexNth( SV *haystack, SV*needle, int nth ) {
    STRLEN lh, ln, i, j;
    char *h = SvPV( haystack, lh );
    char *n = SvPV( needle, ln );

    for( i = 0; i < lh; ++i ) {
        for( j = 0; j < ln; ++j ) {
            if( h[ i + j ] != n[ j ] ) goto nomatch;
        }
        if( --nth == 0 ) return i;
        nomatch:;
    }
    return -1;
}
[download]

Got the C++ to compile -- missing newline at the end of the file, These are the results:

C:\test>1039228.pl
        Rate  regex    buk    cpp      c
regex 4.06/s     --   -97%   -99%  -100%
buk    137/s  3277%     --   -75%   -85%
cpp    546/s 13345%   298%     --   -40%
c      915/s 22446%   568%    68%     --
[download]

And the additional test:

sub c {
  my $copy = $main::string;
  my $ix = indexNth( $copy, '!', 50000 );
  substr( $copy, 0, $ix ) =~ tr/!/\n/;
  return \$copy;
}
[download]

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]