Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

merging strings (xor at char level?)

by csebe (Initiate)
on Jul 03, 2014 at 07:47 UTC ( [id://1092117]=perlquestion: print w/replies, xml ) Need Help??

csebe has asked for the wisdom of the Perl Monks concerning the following question:

Hello wisdom masters, I have the following two strings:
$str1 = "12345 ABC 987 MNO"; $str2 = " CDE";
And I need to merge the strings as to get:
$str = "12345 CDE 987 MNO";

So, kind of a XOR at a character level if you want.

Being looong strings and maaaany in the files I am processing, I look for your wise input to make it as time efficient as possible, much more than the "obvious" solution:
my @arr1 = split //, $str1; my @arr2 = split //, $str2; foreach (my $i=0;$i<@arr1;$i++) { if ($arr2[$i]) { # we do have something in the second string if (($arr1[$i] ne " ") and ($arr2[$i] eq " ")) { $arr2[$i] = $arr1[$i]; } } else { #$str2 is empty now, so copy all $arr2[$i] = $arr1[$i]; } } print join "",@arr2,"\n";
Thanks everyone,
Lian

Replies are listed 'Best First'.
Re: merging strings (xor at char level?)
by moritz (Cardinal) on Jul 03, 2014 at 08:09 UTC

    Instead of going through the string(s) one character at a time, you could split the second string into whitespace and non-whitespace parts, and then apply them blockwise. That way you have far fewer iterations and operations to do.

    use 5.014; use strict; use warnings; my $str1 = "12345 ABC 987 MNO"; my $str2 = " CDE"; my $pos = 0; my $res = ''; for my $chunk (split /([^ ]+)/, $str2) { if (substr($chunk, 0, 1) eq ' ') { $res .= substr $str1, $pos, length $chunk; } else { $res .= $chunk; } } continue { $pos += length $chunk; } if ($pos < length $str1) { $res .= substr $str1, $pos; } say $res;

    If you really want it to be fast and correct, provide us some test cases (Test::Simple or Test::More to the rescue!) and some Benchmarks into which we can plug our solutions.

Re: merging strings (xor at char level?)
by davido (Cardinal) on Jul 03, 2014 at 07:55 UTC

    It looks like "PRM" in your sample output is a typo -- or I simply don't understand the conversion. (I would like to help, but think before we optimize we should fully understand what we're optimizing.) I'm going to assume it's just a typo for now.

    Also, I realize this is just example code, but have you profiled real code with your real files to see where you're spending all your time? If you're spending all your time in IO, this algorithm may not need improving.

    With those caveats in mind, I wanted to give it a shot using a bit-twiddling approach. This solution pushes as much work as I can manage into internals (without resorting to XS), and performs bitwise operations on the strings to derive the result. I think you will find the performance to be decent.:

    my $str1 = "12345 ABC 987 MNO"; my $str2 = " CDE"; sub merge { my $s2 = $_[1] =~ tr/ /\x{0}/r; return $s2 | $_[0] & ( "\x{ff}" x length $_[0] ^ $s2 =~ tr/\x{00}/\x +{FF}/cr ); } print merge( $str1, $str2 ), "\n";

    ...output...

    12345 CDE 987 MNO

    I really wish I could get rid of the tr///'s, but my bit-fu is maxed out for now. ;)

    This bitwise method works by masking off the strings and then or-ing them together. It's a little hard to read, but that might be the price one pays for a fairly efficient solution.

    A last resort, if this doesn't do it for you, would be to re-implement it with Inline::C, though there's so much already pushed into internals that it probably wouldn't gain as much as one would hope. You might also spawn workers that each handle one input file, and limit number of workers to some sane amount. If you're currently CPU bound, but only using one core, this makes sense.

    Beware: Nothing will mojibake your Unicode like treating it as raw bits and bytes. This is for ASCII only.

    Update: Benchmark results are in a followup later in this thread: Re: merging strings (xor at char level?). This bit-twiddling approach is a clear winner for both short and multi-megabyte strings.


    Dave

Re: merging strings (xor at char level?)
by davido (Cardinal) on Jul 03, 2014 at 16:01 UTC

    I really should generate my own huge sample strings to test with, but given your sample input, here are how four implementations benchmark (two of them are mine, one is from moritz, and one is yours).

    Implementations check OK. Rate csebe moritz resubst bitwise csebe 81457/s -- -78% -88% -92% moritz 377674/s 364% -- -45% -63% resubst 686400/s 743% 82% -- -33% bitwise 1030128/s 1165% 173% 50% --

    And here's the benchmark code:

    The bitwise version is a pretty clear winner for the short test input. I think a 10x increase is probably a worthwhile optimization, if it holds true for long strings too.

    Update: I found a little time to test larger strings. For strings that are about 3.5MB, the results are really compelling in favor of the bitwise method:

    # String length is 3515KB. ok 1 - Bitwise ok 2 - Regex/Substr ok 3 - moritz ok 4 - csebe 1..4 Rate csebe moritz resubst bitwise csebe 0.376/s -- -91% -95% -99% moritz 4.11/s 995% -- -46% -94% resubst 7.63/s 1931% 86% -- -89% bitwise 69.1/s 18302% 1581% 806% --

    And here is the updated benchmark code:

    An 180x improvement over the original method, at the cost of being almost illegible. ;) Putting it in perspective, your original solution would take about 2.7 seconds to process a 3.5mb string. In that time the bitwise solution would process 186 strings of that size, or 651MB of data in all. In such case, the application would probably become IO bound.


    Dave

      Jeez, I feel really bad for not visiting this page again since original posting. The site failed somehow to announce me that I got replies to my topic (or I might have had problems with my email?), but still, it looks terrible...

      Excellent idea and testing work Dave. Also thanks Moritz.

      Again, sorry for being an assh0le.

      Cheers, Lian
        There are no email announcments. You get in-site personal messages, though.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: merging strings (xor at char level?)
by perlfan (Vicar) on Jul 03, 2014 at 11:39 UTC
    Is your data in fixed width fields?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1092117]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2024-04-20 05:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found