Re: merging strings (xor at char level?)

I really should generate my own huge sample strings to test with, but given your sample input, here are how four implementations benchmark (two of them are mine, one is from moritz, and one is yours).

Implementations check OK.
             Rate   csebe  moritz resubst bitwise
csebe     81457/s      --    -78%    -88%    -92%
moritz   377674/s    364%      --    -45%    -63%
resubst  686400/s    743%     82%      --    -33%
bitwise 1030128/s   1165%    173%     50%      --
[download]

And here's the benchmark code:

use Benchmark 'cmpthese';

our $str1 = "12345 ABC 987 MNO";
our $str2 = "      CDE";



sub merge_csebe {
  my( $s1, $s2 ) = map { [ split //, $_ ] } @_;
  for ( my $i = 0; $i < @$s1; $i++ ) {
    if( $s2->[$i]) {
      if( ( $s1->[$i] ne q{ } ) && ( $s2->[$i] eq q{ } ) ) {
        $s2->[$i] = $s1->[$i];
      }
    }
    else {
      $s2->[$i] = $s1->[$i];
    }
  }
  return join q{}, @$s2;
}



sub merge_bitwise {
  my $s2 = $_[1] =~ tr/ /\x{0}/r;
  return $s2 | $_[0] & ( "\x{ff}" x length $_[0] ^ $s2 =~ tr/\x{00}/\x
+{FF}/cr );
}



sub merge_resubstr {
  my $s1 = shift;
  while( $_[0] =~ m/(\S+)/g ) {
    substr( $s1, pos($_[0])-length($1), length($1) ) = $1;
  }
  return $s1;
}



sub merge_moritz {
  my $pos = 0;
  my $res = '';

  for my $chunk (split /([^ ]+)/, $_[1]) {
      if (substr($chunk, 0, 1) eq ' ') {
          $res .= substr $_[0], $pos, length $chunk;
      }
      else {
          $res .= $chunk;
      }
  }
  continue {
      $pos += length $chunk;
  }

  if ($pos < length $_[0]) {
      $res .= substr $_[0], $pos;
  }
  return $res;
}

print "Implementations check OK.\n"
  if(  4 == grep { $_ eq '12345 CDE 987 MNO' }
              merge_bitwise($str1,$str2),
              merge_resubstr($str1,$str2),
              merge_moritz($str1,$str2),
              merge_csebe($str1,$str2),
  );

cmpthese ( -5, {
  bitwise => sub { my $rv = \&merge_bitwise ($main::str1,$main::str2) 
+},
  resubst => sub { my $rv = \&merge_resubstr($main::str1,$main::str2) 
+},
  moritz  => sub { my $rv = \&merge_moritz  ($main::str1,$main::str2) 
+},
  csebe   => sub { my $rv = \&merge_csebe   ($main::str1,$main::str2) 
+},
} );
[download]

The bitwise version is a pretty clear winner for the short test input. I think a 10x increase is probably a worthwhile optimization, if it holds true for long strings too.

Update: I found a little time to test larger strings. For strings that are about 3.5MB, the results are really compelling in favor of the bitwise method:

# String length is 3515KB.
ok 1 - Bitwise
ok 2 - Regex/Substr
ok 3 - moritz
ok 4 - csebe
1..4
           Rate   csebe  moritz resubst bitwise
csebe   0.376/s      --    -91%    -95%    -99%
moritz   4.11/s    995%      --    -46%    -94%
resubst  7.63/s   1931%     86%      --    -89%
bitwise  69.1/s  18302%   1581%    806%      --
[download]

And here is the updated benchmark code:

use Benchmark 'cmpthese';
use Test::More;

use constant DURATION => 20; # seconds.
use constant LENGTH   => 200000; # repititions of the sample strings.

our $str1 = join q{ }, ("12345 ABC 987 MNO") x LENGTH;
our $str2 = join q{ }, ("      CDE        ") x LENGTH;
our $want = join q{ }, ("12345 CDE 987 MNO") x LENGTH;

diag "String length is ", sprintf( "%d",length($str1)/1024), "KB.";

sub merge_csebe {
  my( $s1, $s2 ) = map { [ split //, $_ ] } @_;
  for ( my $i = 0; $i < @$s1; $i++ ) {
    if( $s2->[$i]) {
      if( ( $s1->[$i] ne q{ } ) && ( $s2->[$i] eq q{ } ) ) {
        $s2->[$i] = $s1->[$i];
      }
    }
    else {
      $s2->[$i] = $s1->[$i];
    }
  }
  return join q{}, @$s2;
}

sub merge_bitwise {
  my $s2 = $_[1] =~ tr/ /\x{0}/r;
  return $s2 | $_[0] & ( "\x{ff}" x length $_[0] ^ $s2 =~ tr/\x{00}/\x
+{FF}/cr );
}

sub merge_resubstr {
  my $s1 = shift;
  while( $_[0] =~ m/(\S+)/g ) {
    substr( $s1, pos($_[0])-length($1), length($1) ) = $1;
  }
  return $s1;
}

sub merge_moritz {
  my $pos = 0;
  my $res = '';

  for my $chunk (split /([^ ]+)/, $_[1]) {
      if (substr($chunk, 0, 1) eq ' ') {
          $res .= substr $_[0], $pos, length $chunk;
      }
      else {
          $res .= $chunk;
      }
  }
  continue {
      $pos += length $chunk;
  }

  if ($pos < length $_[0]) {
      $res .= substr $_[0], $pos;
  }
  return $res;
}


is( merge_bitwise ($str1,$str2), $want, "Bitwise"      );
is( merge_resubstr($str1,$str2), $want, "Regex/Substr" );
is( merge_moritz  ($str1,$str2), $want, "moritz"       );
is( merge_csebe   ($str1,$str2), $want, "csebe"        );

done_testing();

cmpthese ( -DURATION(), {
  bitwise => sub { my $rv = \&merge_bitwise ($main::str1,$main::str2) 
+},
  resubst => sub { my $rv = \&merge_resubstr($main::str1,$main::str2) 
+},
  moritz  => sub { my $rv = \&merge_moritz  ($main::str1,$main::str2) 
+},
  csebe   => sub { my $rv = \&merge_csebe   ($main::str1,$main::str2) 
+},
} );
[download]

An 180x improvement over the original method, at the cost of being almost illegible. ;) Putting it in perspective, your original solution would take about 2.7 seconds to process a 3.5mb string. In that time the bitwise solution would process 186 strings of that size, or 651MB of data in all. In such case, the application would probably become IO bound.

Dave

Comment on Re: merging strings (xor at char level?) Select or Download Code

Replies are listed 'Best First'.
Re^2: merging strings (xor at char level?) by csebe (Initiate) on Feb 27, 2015 at 11:06 UTC
Jeez, I feel really bad for not visiting this page again since original posting. The site failed somehow to announce me that I got replies to my topic (or I might have had problems with my email?), but still, it looks terrible... Excellent idea and testing work Dave. Also thanks Moritz. Again, sorry for being an assh0le. Cheers, Lian	[reply]
Re^3: merging strings (xor at char level?) by choroba (Cardinal) on Feb 28, 2015 at 12:31 UTC
There are no email announcments. You get in-site personal messages, though. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]