garyboyd has asked for the wisdom of the Perl Monks concerning the following question:

What would be the most efficient way of merging two almost identical strings?

I have two dna sequences as strings, for example:

$string1:

AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA

$string2:

ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA

$string1 has an ambiguous base on the second character (Y) and $string2 has an ambiguous base on the 52nd character (K). Ambiguous bases can be denoted within the string by the following characters RYSWKMBDHV.

I want to be able to merge multiple DNA strings so that the nucleotide sequences are conserved, except where an ambiguous base occurs, so in the above example I would get a new string that looked like:

$new_string:

AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA

So the new string contains both the ambiguous base at position 2 and position 52.

Replies are listed 'Best First'.
Re: merging dna sequences
by moritz (Cardinal) on Nov 10, 2011 at 10:30 UTC
    use strict; use warnings; my $a = 'AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA'; my $b = 'ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA'; my $merged = $a; while ($b =~ /([RYSWKMBDHV])/g) { my $pos = $-[0]; substr $merged, $pos, 1, $1; } # $merge contains what you want

    The only "unusual" thing in there is the usage of @- to get the match position.

Re: merging dna sequences
by Eliya (Vicar) on Nov 10, 2011 at 11:02 UTC

    In the spirit of TIMTOWTDI, here is another approach:

    my $s1 = "AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA"; my $s2 = "ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA"; my $s1_ = $s1; $s1_ =~ tr/ACGT/\0/c; my $s2_ = $s2; $s2_ =~ tr/ACGT/\0/c; my $merged = $s1 ^ $s1_ ^ $s2 ^ $s2_ | $s1_ & $s2_; say $merged; # AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTK +A
      I too thought "bitwise string" but ended up with a different mix of operators:
      my $s = 'AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA'; my $t = 'ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA'; tr/ACGTA-Z/\0\0\0\0_/ for my ($s_, $t_) = ($s, $t); my $merged = ($s | $t_) & ($t | $s_); say $merged;
        This one is inspired by moritz's solution, which only replaced the last "T" with a "K" in the first string. It has one less bit-op than my original, so should be a bit faster. (LOL... that one snuck up on me while typing)
        my $s = 'AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA'; my $t = 'ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA'; (my $del = $t) =~ tr/ACGTA-Z/\0\0\0\0_/; (my $add = $t) =~ tr/ACGT/_/; my $merged = ($s | $del) & $add; say $merged;

      Thank you both for your solutions - would this still work for more than 2 strings?

        Is f'(f'(A,B),C)) the same as F(A,B,C)? If so, then just run a loop on your array of strings, replacing the first two with the results if f'(A,B). By the time you are down to a single string left, you will have the same results as F(A,B,C,....).

        List::Util's reduce function can be helpful in implementing this.

        --MidLifeXis

Re: merging dna sequences
by Sinistral (Monsignor) on Nov 10, 2011 at 14:05 UTC

    And, in the spirit of providing general information based on your request for assistance, make sure to check out the amazing BioPerl site/distribution/documentation!

      ...check out the amazing BioPerl site/distribution/documentation

      While I generally agree with not reinventing the wheel etc. ... IMHO, when performance matters (and it usually does in this domain), it's well worth checking out alternatives to what BioPerl provides, as BioPerl's solutions often tend to be rather slow.