in reply to Difference Of Two Strings

Another implementation for you. This one seems pretty speedy (update: optimized and benchmarks updated with everyone else's updates):
sub fastolfe { my $source = shift; my $chop = shift; local($_); my %found; $found{$_}++ while ($_ = chop($source)) ne ''; while (($_ = chop($chop)) ne '') { return if --$found{$_} < 0; } my $result; foreach (sort keys %found) { $result .= $_ while $found{$_}--; } $result; }
Hybridizing the above with merlyn's version, we can squeeze out a bit more speed:
sub fast_merl { my ($source, $chop) = @_; local($_); my %found; $found{$_}++ while ($_ = chop($source)) ne ''; while (($_ = chop($chop)) ne '') { return if --$found{$_} < 0; } return join "", map { $_ x $found{$_} } keys %found; }
Benchmarking most of the versions I see thus far (minus a couple of the less interesting ones, because these benchmarks are getting big), using test input from above (2 success, 2 fail) as well as demerphq's tests below:
# demerphq's test set Rate blakem merlyn demq_scan demerphq fastolfe fast_merl +fast_c scan_c blakem 479/s -- -24% -39% -44% -46% -47% + -97% -98% merlyn 629/s 31% -- -20% -27% -29% -30% + -96% -97% demq_scan 790/s 65% 26% -- -8% -11% -12% + -95% -97% demerphq 861/s 80% 37% 9% -- -3% -4% + -95% -96% fastolfe 887/s 85% 41% 12% 3% -- -1% + -94% -96% fast_merl 901/s 88% 43% 14% 5% 2% -- + -94% -96% fast_c 15708/s 3179% 2396% 1889% 1723% 1671% 1644% + -- -31% scan_c 22648/s 4627% 3499% 2767% 2529% 2453% 2415% + 44% -- # simple success case Rate blakem merlyn demerphq fastolfe fast_merl demq_scan +fast_c scan_c blakem 6244/s -- -14% -26% -32% -35% -43% + -90% -93% merlyn 7221/s 16% -- -14% -21% -24% -34% + -89% -92% demerphq 8429/s 35% 17% -- -8% -12% -23% + -87% -91% fastolfe 9181/s 47% 27% 9% -- -4% -16% + -86% -90% fast_merl 9563/s 53% 32% 13% 4% -- -12% + -85% -90% demq_scan 10908/s 75% 51% 29% 19% 14% -- + -83% -88% fast_c 65634/s 951% 809% 679% 615% 586% 502% + -- -28% scan_c 91428/s 1364% 1166% 985% 896% 856% 738% + 39% -- # simple failure case Rate blakem merlyn demerphq demq_scan fastolfe fast_merl + scan_c fast_c blakem 7759/s -- -39% -51% -60% -63% -63% + -94% -94% merlyn 12666/s 63% -- -20% -35% -39% -40% + -89% -91% demerphq 15783/s 103% 25% -- -19% -24% -26% + -87% -89% demq_scan 19581/s 152% 55% 24% -- -5% -8% + -84% -86% fastolfe 20720/s 167% 64% 31% 6% -- -2% + -83% -85% fast_merl 21209/s 173% 67% 34% 8% 2% -- + -82% -85% scan_c 119642/s 1442% 845% 658% 511% 477% 464% + -- -14% fast_c 139171/s 1694% 999% 782% 611% 572% 556% + 16% --
Source: http://fastolfe.net/transient/2001/11/02/pm.string.difference.bench

Replies are listed 'Best First'.
Re: Re: Difference Of Two Strings (Benchmarks 2)
by demerphq (Chancellor) on Nov 03, 2001 at 08:01 UTC
    UPDATED to reflect Fastolfes reply.

    Heh. While I was setting up my benchmark there everyone else was as well. I came up with a different set of results. First jungleboy, runrig and yuckfoo failed outright on some of my test cases. Second both merlyn and fast_merl failed because the resulting set of letters are out of order, but when I put a sort clause in as was recommended they passed fine. Yours was the fastest. Only yours, mine and blakem succeded outright.

    # Updated: Removed unecessary debug code, minor tidy. use strict; use warnings; use Benchmark 'cmpthese'; #----------------------------------------------------------- our %subs = ( yuckfoo => sub { my ( $full, $part ) = @_; my $left = ""; my ($regx) = join ( '+.*', ( split ( '', $part ) ) ); if ( $full =~ m{$regx} ) { my (@fulls) = split ( '', $full ); my (@parts) = split ( '', $part ); while ( my $ch = shift (@fulls) ) { ( @parts && $ch eq $parts[0] ) ? shift (@parts) : ( $left .= $ch ); } } else { $left = "__undef__"; } return $left; }, jungleboy => sub { my ( $full, $part ) = @_; my ($ch); my (@parts) = split ( '', $part ); my ($regx) = join ( '+.*', @parts ); if ( $full =~ m{$regx} ) { foreach $ch (@parts) { $full =~ s/[$ch]{1}//; } } else { $full = "__undef__"; } return $full; }, demerphq => sub { my ( $from, $to ) = @_; my %ltrs; $ltrs{ substr( $from, $_, 1 ) }++ foreach 0 .. length($from) - + 1; --$ltrs{ substr( $to, $_, 1 ) } < 0 && return "__undef__" foreach 0 .. length($to) - 1; return join ( "", sort map { $_ x $ltrs{$_} } keys %ltrs ); }, demq_scan => sub { my ( $from, $to ) = @_; my $ret = ""; my ( $f, $t ) = ( 0, 0 ); while (1) { my ( $fc, $tc ) = ( substr( $from, $f, 1 ), substr( $to, $ +t, 1 ) ); if ( $fc eq $tc ) { $t++; $f++; if ( substr( $to, $t, 1 ) ne $tc ) { $f++, $ret .= $fc while substr( $from, $f, 1 ) eq +$fc; } last if $t == length($to); } elsif ( $fc lt $tc ) { $ret .= $fc; $f++; return "__undef__" if $f >= length $from; } else { return "__undef__"; } } return $ret . substr( $from, $f ); }, blakem => sub { my $string = shift; my $letters = shift; $string =~ s/$_// || return "__undef__" for split // => $lette +rs; return $string; }, fastolfe => sub { my $source = shift; my $chop = shift; local ($_); my %found; $found{$_}++ while ( $_ = chop($source) ) ne ''; while ( ( $_ = chop($chop) ) ne '' ) { return "__undef__" if --$found{$_} < 0; } my $result = ""; #fixed demerphq foreach ( sort keys %found ) { $result .= $_ while $found{$_}--; } $result; }, merlyn => sub { my ( $full, $part ) = @_; my %count; $count{$_}++ for split //, $full; for ( split //, $part ) { return "__undef__" if --$count{$_} < 0; } return join "", map { $_ x $count{$_} } sort keys %count; }, fast_merl => sub { my ( $source, $chop ) = @_; local ($_); my %found; $found{$_}++ while ( $_ = chop($source) ) ne ''; while ( ( $_ = chop($chop) ) ne '' ) { return "__undef__" if --$found{$_} < 0; } return join "", map { $_ x $found{$_} } sort keys %found; } # Not sure why it fails.. # runrig => sub { # my $str = shift; # my $letters = shift; # my ( %hash1, %hash2 ); # my @arr1 = split //, $str; # @hash1{@arr1} = @arr1; # my @arr2 = split //, $letters; # my @deleted = grep $_, delete @hash1{@arr2}; # @hash2{@arr2} = @arr2; # delete @hash2{@deleted}; # return "__undef__" if %hash2; # join '', keys %hash1; # }, ); sub test { my ($sub) = @_; foreach my $t ( # tests [0] - [1] = [2] [qw"ab a b"], [qw"acciimmnnnnoootu acimnotu cimnnnoo"], [qw"acciimmnnnnoootu acimnotuu __undef__"], [qw"ab ab", ""], [qw"aaaaaaabbbbbbbcccccccddddddddde e aaaaaaabbbbbbbcccccccddd +dddddd"], [qw"aaaaaaabbbbbbbcccccccddddddddde aaaaaaabbbbbbbcccccccddddd +dddd e"], [qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxyz aaaaa +aabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxy z"], [qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxyz aaaaa +bbbbbbcccccddddddddghijklmnopqrstuvwxy aabccdefz"], [qw"abbcccdde bccd abcde"], [qw"abbcccdde bccdf __undef__"], [qw"bbcccdde abccdf __undef__"], #uncomment me to kill the regex versions #[qw"aaaaaaabbbbbbbcccccccdddddddddefghijklmnopqrstuvwxz aaaaa +bbbbbbcccccddddddddghijklmnopqrstuvwxy __undef__"], ) { #warn "$sub @$t\n"; my $r = $subs{$sub}->( $t->[0], $t->[1] ); $r = ( !defined($r) ? "undef" : $r ); #warn "'$r'\n"; die "in $sub expected $t->[2] got $r from $t->[0] - $t->[1]" if $t->[2] ne $r; } } my %tests = map { $_ => "test($_)" } keys %subs; cmpthese( -10, \%tests ); __END__
    Benchmark: running blakem, demerphq, demerphq_scan, fast_merl, fastolfe, jungleboy, merlyn, yuckfoo, 
    each for at least 10 CPU seconds...
        blakem: 11 wallclock secs (10.52 usr +  0.00 sys = 10.52 CPU) @ 497.48/s (n=5231)
      demerphq: 11 wallclock secs (10.72 usr +  0.00 sys = 10.72 CPU) @ 755.88/s (n=8100)
     demq_scan: 11 wallclock secs (10.57 usr +  0.00 sys = 10.57 CPU) @ 731.63/s (n=7737)
     fast_merl: 10 wallclock secs (10.61 usr +  0.00 sys = 10.61 CPU) @ 768.34/s (n=8149)
      fastolfe: 10 wallclock secs (10.55 usr +  0.00 sys = 10.55 CPU) @ 779.23/s (n=8217)
     jungleboy: 10 wallclock secs (10.18 usr +  0.00 sys = 10.18 CPU) @ 20.03/s (n=204)
        merlyn: 10 wallclock secs (10.56 usr +  0.00 sys = 10.56 CPU) @ 595.40/s (n=6285)
       yuckfoo: 10 wallclock secs (10.50 usr +  0.00 sys = 10.50 CPU) @ 20.10/s (n=211)
                    Rate jungleboy yuckfoo blakem merlyn demerphq_scan demerphq fast_merl fastolfe
    jungleboy     20.0/s        --     -0%   -96%   -97%          -97%     -97%      -97%     -97%
    yuckfoo       20.1/s        0%      --   -96%   -97%          -97%     -97%      -97%     -97%
    blakem         497/s     2383%   2374%     --   -16%          -32%     -34%      -35%     -36%
    merlyn         595/s     2872%   2861%    20%     --          -19%     -21%      -23%     -24%
    demq_scan      732/s     3552%   3539%    47%    23%            --      -3%       -5%      -6%
    demerphq       756/s     3673%   3660%    52%    27%            3%       --       -2%      -3%
    fast_merl      768/s     3736%   3722%    54%    29%            5%       2%        --      -1%
    fastolfe       779/s     3790%   3776%    57%    31%            7%       3%        1%       --
    

    Yves / DeMerphq
    --
    Have you registered your Name Space?

      Yah I took out the sort in fast_merl because the requirements didn't state the returned string had to be sorted, just that the input was. The sort has a very slight performance penalty. It's pretty easy to put back.

      Good set of exhaustive tests.

Re: Re: Difference Of Two Strings (complete benchmarks)
by YuckFoo (Abbot) on Nov 07, 2001 at 04:27 UTC
    Thanks for the chop modification. Nice touch. It seems to be at least 10-20% faster than split, I'll take it! I also appreciate the extensive bench compliation. Very interesting, and very informative.