in reply to Re^5: Fast Replacement (0.01 seconds)
in thread Fast Replacement

Wait, please tell me I'm reading the results wrong.... In my benchmarks yours was faster. But in your benchmarks, "a", which is your algorithm, is taking 5.xx seconds per iteration, whereas "b", which is mine, is taking 0.17-0.9 seconds per iteration. Your benchmark seems to be showing the regexp approach winning by a landslide.


Dave

Replies are listed 'Best First'.
Re^7: Fast Replacement (0.01 seconds)
by BrowserUk (Patriarch) on Jun 14, 2013 at 22:13 UTC

    Using the eval subroutines was a step too far. Whilst much better than eval for every line, the additional subroutine call still has a substantial impact.

    Going back to hardcoded trs, and you get the picture we were both expecting:

    C:\test\ACA>..\junk71 -N=4 (warning: too few iterations for a reliable count) Rate b a b 1.45/s -- -98% a 85.1/s 5751% -- C:\test\ACA>..\junk71 -N=10 (warning: too few iterations for a reliable count) Rate b a b 2.94/s -- -97% a 91.7/s 3025% -- C:\test\ACA>..\junk71 -N=20 (warning: too few iterations for a reliable count) Rate b a b 4.55/s -- -96% a 116/s 2453% -- C:\test\ACA>..\junk71 -N=50 (warning: too few iterations for a reliable count) Rate b a b 6.67/s -- -95% a 133/s 1900% -- C:\test\ACA>..\junk71 -N=100 Rate b a b 7.93/s -- -95% a 149/s 1776% --

    Updated benchmark code:

    #! perl -slw use strict; use Benchmark qw[ cmpthese ]; sub makeTR{ eval "sub{ \$_[ 0 ] =~ tr[$_[0]][$_[1]] }"; } our $N //= 10; die "$N must be even and positive" if $N &1 or $N < 2; our $tr1 = makeTR( '!', "\n" ); our $tr2 = makeTR( "\n", '!' ); our $flag = 0; our $s = '1234!' x 55e3; cmpthese $N, { a => q[ if( $flag ) { my( $p, $c ) = ( 0, 50e3 ); 1 while --$c and $p = index $s, "\n", $p; $s =~ tr[\n][!]; #$tr2->( substr $s, 0, $p ); $flag ^= 1; } else { my( $p, $c ) = ( 0, 50e3 ); 1 while --$c and $p = index $s, "!", $p; $s =~ tr[!][\n]; #$tr1->( substr $s, 0, $p ); $flag ^= 1; } ], b => q[ if( $flag ) { $s =~ s/\n(??{ ( $myregexp::count++ < 50000 ) ? '' : '(?!) +' })/!/g; $flag ^= 1; } else { $s =~ s/!(??{ ( $myregexp::count++ < 50000 ) ? '' : '(?!)' + })/\n/g; $flag ^= 1; } ], };

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    ,

      When I have a moment I want to check to see if this is a faster way of finding the 50k-th "!":

      if( $string =~ m/(?:[^!]*!){49999}[^!]*!/; ) { print $+[0] + 1, "\n" # Updated. }

      It *could* be, because it keeps all the work inside of Perl's internals. If Perl had a native function that said, "find_nth($string, '!')", it would beat any regexp solution, but it doesn't (at least not as a built-in).

      Update: "Quantifier in {,} bigger than 32766 in regex; marked by <-- HERE in m/(?:[^!]*!){ <-- HERE 49999}[^!]*!/ at mytest10.pl line 27." (I forgot about that.)

      Update 2: Assuming we're only dealing with ASCII, this is trivial using Inline::C. Walking the string and making the change up to 50k times will be trivial, and fast. If I get around to it I'll post an example.


      Dave

      Just for fun... I created an nth_index in C++ that sorta mimics the behavior of index. ..."sorta", in that it will return -1 if the nth string isn't found, and will assume an offset greater than haystack length is equal to the haystack size-1, or less than zero is zero. But where it will most obviously fall short is that it's only capable of dealing in standard "byte" sized chars; no Unicode.

      use Benchmark qw( cmpthese ); use Inline CPP => 'DATA'; $main::string = '1234!' x 55e3; sub cpp { my $copy = $main::string; my $ix = nth_index( $copy, '!', 50000 ); substr( $copy, 0, $ix ) =~ tr/!/\n/; return \$copy; } sub regex { my $copy = $main::string; $myregexp::count = 0; $copy =~ s/!(??{ ( $myregexp::count++ < 50000 ) ? '' : '(?!)' })/\n/ +g; return \$copy; } sub buk { my( $p, $c ) = ( 0, 50e3 ); my $copy = $main::string; 1 while --$c and $p = index $copy, '!', $p; substr( $copy, 0, $p ) =~ tr[!][\n]; return \$copy; } cmpthese ( -5, { cpp => \&cpp, regex => \&regex, buk => \&buk, } ); __DATA__ __CPP__ #include <string> long nth_index ( const char* c_str_haystack, const char* c_str_needle, long n, long offset = 0 ) { std::string haystack( c_str_haystack ); std::string needle( c_str_needle ); if( ! needle.size() || ! haystack.size() ) return -1; if( offset < 0 ) offset = 0; if( offset >= haystack.size() ) offset = haystack.size() - 1; size_t pos = haystack.find( needle, offset ); while( --n ) { pos = haystack.find( needle, pos + 1 ); if( pos == std::string::npos ) return -1; } return pos; }

      The results:

      $ ./mytest10.pl Rate regex buk cpp regex 10.0/s -- -97% -99% buk 291/s 2803% -- -63% cpp 777/s 7657% 167% --

      This goes back to the original benchmark where I just made copies. I understood the xor-flag method, but I already had this written and didn't bother to rewrite.

      There's a gross inefficiency in the C++ version, that if fixed, would greatly improve the speed of that solution. The biggest inefficiency is that I'm copying the c-strings that Perl passes into the subroutine into C++ "string" objects. This is a linear operation, and really totally unnecessary if I were motivated enough to forgo the "string::find" method and roll my own instead. If I were to do that I could just deal directly with C strings and avoid making another copy. But I wanted to play with C++ strings just to see how it would look.

      Nevertheless, the results are predictably good. Don't try this on Unicode strings though. And if the needle string has a length greater than one, we would have to be aware that my method will detect needles that overlap each other. If we wanted to avoid that the C++ function should add the length of the needle to "pos".

      The exercise was beneficial, as I discovered a bug in Inline::CPP's documentation on how to declare user typemaps, that I'll be able to fix in an upcoming release.


      Dave

        I came up with a similar I::C implementation. I'd run it against yours but yours doesn't compile on my machine at the moment:

        int indexNth( SV *haystack, SV*needle, int nth ) { STRLEN lh, ln, i, j; char *h = SvPV( haystack, lh ); char *n = SvPV( needle, ln ); for( i = 0; i < lh; ++i ) { for( j = 0; j < ln; ++j ) { if( h[ i + j ] != n[ j ] ) goto nomatch; } if( --nth == 0 ) return i; nomatch:; } return -1; }

        Got the C++ to compile -- missing newline at the end of the file, These are the results:

        C:\test>1039228.pl Rate regex buk cpp c regex 4.06/s -- -97% -99% -100% buk 137/s 3277% -- -75% -85% cpp 546/s 13345% 298% -- -40% c 915/s 22446% 568% 68% --

        And the additional test:

        sub c { my $copy = $main::string; my $ix = indexNth( $copy, '!', 50000 ); substr( $copy, 0, $ix ) =~ tr/!/\n/; return \$copy; }

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^7: Fast Replacement (0.01 seconds)
by BrowserUk (Patriarch) on Jun 14, 2013 at 21:52 UTC

    Holy crap! You're right! (I saw what I was expecting to see :( )

    Unless there is some bug I haven't spotted, ...


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.