I really should generate my own huge sample strings to test with, but given your sample input, here are how four implementations benchmark (two of them are mine, one is from moritz, and one is yours).

Implementations check OK. Rate csebe moritz resubst bitwise csebe 81457/s -- -78% -88% -92% moritz 377674/s 364% -- -45% -63% resubst 686400/s 743% 82% -- -33% bitwise 1030128/s 1165% 173% 50% --

And here's the benchmark code:

use Benchmark 'cmpthese'; our $str1 = "12345 ABC 987 MNO"; our $str2 = " CDE"; sub merge_csebe { my( $s1, $s2 ) = map { [ split //, $_ ] } @_; for ( my $i = 0; $i < @$s1; $i++ ) { if( $s2->[$i]) { if( ( $s1->[$i] ne q{ } ) && ( $s2->[$i] eq q{ } ) ) { $s2->[$i] = $s1->[$i]; } } else { $s2->[$i] = $s1->[$i]; } } return join q{}, @$s2; } sub merge_bitwise { my $s2 = $_[1] =~ tr/ /\x{0}/r; return $s2 | $_[0] & ( "\x{ff}" x length $_[0] ^ $s2 =~ tr/\x{00}/\x +{FF}/cr ); } sub merge_resubstr { my $s1 = shift; while( $_[0] =~ m/(\S+)/g ) { substr( $s1, pos($_[0])-length($1), length($1) ) = $1; } return $s1; } sub merge_moritz { my $pos = 0; my $res = ''; for my $chunk (split /([^ ]+)/, $_[1]) { if (substr($chunk, 0, 1) eq ' ') { $res .= substr $_[0], $pos, length $chunk; } else { $res .= $chunk; } } continue { $pos += length $chunk; } if ($pos < length $_[0]) { $res .= substr $_[0], $pos; } return $res; } print "Implementations check OK.\n" if( 4 == grep { $_ eq '12345 CDE 987 MNO' } merge_bitwise($str1,$str2), merge_resubstr($str1,$str2), merge_moritz($str1,$str2), merge_csebe($str1,$str2), ); cmpthese ( -5, { bitwise => sub { my $rv = \&merge_bitwise ($main::str1,$main::str2) +}, resubst => sub { my $rv = \&merge_resubstr($main::str1,$main::str2) +}, moritz => sub { my $rv = \&merge_moritz ($main::str1,$main::str2) +}, csebe => sub { my $rv = \&merge_csebe ($main::str1,$main::str2) +}, } );

The bitwise version is a pretty clear winner for the short test input. I think a 10x increase is probably a worthwhile optimization, if it holds true for long strings too.

Update: I found a little time to test larger strings. For strings that are about 3.5MB, the results are really compelling in favor of the bitwise method:

# String length is 3515KB. ok 1 - Bitwise ok 2 - Regex/Substr ok 3 - moritz ok 4 - csebe 1..4 Rate csebe moritz resubst bitwise csebe 0.376/s -- -91% -95% -99% moritz 4.11/s 995% -- -46% -94% resubst 7.63/s 1931% 86% -- -89% bitwise 69.1/s 18302% 1581% 806% --

And here is the updated benchmark code:

use Benchmark 'cmpthese'; use Test::More; use constant DURATION => 20; # seconds. use constant LENGTH => 200000; # repititions of the sample strings. our $str1 = join q{ }, ("12345 ABC 987 MNO") x LENGTH; our $str2 = join q{ }, (" CDE ") x LENGTH; our $want = join q{ }, ("12345 CDE 987 MNO") x LENGTH; diag "String length is ", sprintf( "%d",length($str1)/1024), "KB."; sub merge_csebe { my( $s1, $s2 ) = map { [ split //, $_ ] } @_; for ( my $i = 0; $i < @$s1; $i++ ) { if( $s2->[$i]) { if( ( $s1->[$i] ne q{ } ) && ( $s2->[$i] eq q{ } ) ) { $s2->[$i] = $s1->[$i]; } } else { $s2->[$i] = $s1->[$i]; } } return join q{}, @$s2; } sub merge_bitwise { my $s2 = $_[1] =~ tr/ /\x{0}/r; return $s2 | $_[0] & ( "\x{ff}" x length $_[0] ^ $s2 =~ tr/\x{00}/\x +{FF}/cr ); } sub merge_resubstr { my $s1 = shift; while( $_[0] =~ m/(\S+)/g ) { substr( $s1, pos($_[0])-length($1), length($1) ) = $1; } return $s1; } sub merge_moritz { my $pos = 0; my $res = ''; for my $chunk (split /([^ ]+)/, $_[1]) { if (substr($chunk, 0, 1) eq ' ') { $res .= substr $_[0], $pos, length $chunk; } else { $res .= $chunk; } } continue { $pos += length $chunk; } if ($pos < length $_[0]) { $res .= substr $_[0], $pos; } return $res; } is( merge_bitwise ($str1,$str2), $want, "Bitwise" ); is( merge_resubstr($str1,$str2), $want, "Regex/Substr" ); is( merge_moritz ($str1,$str2), $want, "moritz" ); is( merge_csebe ($str1,$str2), $want, "csebe" ); done_testing(); cmpthese ( -DURATION(), { bitwise => sub { my $rv = \&merge_bitwise ($main::str1,$main::str2) +}, resubst => sub { my $rv = \&merge_resubstr($main::str1,$main::str2) +}, moritz => sub { my $rv = \&merge_moritz ($main::str1,$main::str2) +}, csebe => sub { my $rv = \&merge_csebe ($main::str1,$main::str2) +}, } );

An 180x improvement over the original method, at the cost of being almost illegible. ;) Putting it in perspective, your original solution would take about 2.7 seconds to process a 3.5mb string. In that time the bitwise solution would process 186 strings of that size, or 651MB of data in all. In such case, the application would probably become IO bound.


Dave


In reply to Re: merging strings (xor at char level?) by davido
in thread merging strings (xor at char level?) by csebe

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.