merging dna sequences

garyboyd has asked for the wisdom of the Perl Monks concerning the following question:

What would be the most efficient way of merging two almost identical strings?

I have two dna sequences as strings, for example:

$string1:

AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA

$string2:

ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA

$string1 has an ambiguous base on the second character (Y) and $string2 has an ambiguous base on the 52nd character (K). Ambiguous bases can be denoted within the string by the following characters RYSWKMBDHV.

I want to be able to merge multiple DNA strings so that the nucleotide sequences are conserved, except where an ambiguous base occurs, so in the above example I would get a new string that looked like:

$new_string:

AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA

So the new string contains both the ambiguous base at position 2 and position 52.

Comment on merging dna sequences

Replies are listed 'Best First'.
Re: merging dna sequences by moritz (Cardinal) on Nov 10, 2011 at 10:30 UTC
`use strict; use warnings; my $a = 'AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA'; my $b = 'ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA'; my $merged = $a; while ($b =~ /([RYSWKMBDHV])/g) { my $pos = $-[0]; substr $merged, $pos, 1, $1; } # $merge contains what you want` [download] The only "unusual" thing in there is the usage of @- to get the match position. Perl 6 - second systems done right	[reply] [d/l]
Re: merging dna sequences by Eliya (Vicar) on Nov 10, 2011 at 11:02 UTC
In the spirit of TIMTOWTDI, here is another approach: `my $s1 = "AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA"; my $s2 = "ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA"; my $s1_ = $s1; $s1_ =~ tr/ACGT/\0/c; my $s2_ = $s2; $s2_ =~ tr/ACGT/\0/c; my $merged = $s1 ^ $s1_ ^ $s2 ^ $s2_ \| $s1_ & $s2_; say $merged; # AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTK +A` [download]	[reply] [d/l]
Re^2: merging dna sequences by Anonymous Monk on Nov 10, 2011 at 22:11 UTC
I too thought "bitwise string" but ended up with a different mix of operators: `my $s = 'AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA'; my $t = 'ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA'; tr/ACGTA-Z/\0\0\0\0_/ for my ($s_, $t_) = ($s, $t); my $merged = ($s \| $t_) & ($t \| $s_); say $merged;` [download]	[reply] [d/l]
Re^3: merging dna sequences by Anonymous Monk on Nov 10, 2011 at 23:35 UTC
This one is inspired by moritz's solution, which only replaced the last "T" with a "K" in the first string. It has one less bit-op than my original, so should be a bit faster. (LOL... that one snuck up on me while typing) `my $s = 'AYGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTTA'; my $t = 'ACGTACTAGACTACAGACTACAGACATCTACAGACTCATCAGCAGCATATTKA'; (my $del = $t) =~ tr/ACGTA-Z/\0\0\0\0_/; (my $add = $t) =~ tr/ACGT/_/; my $merged = ($s \| $del) & $add; say $merged;` [download]	[reply] [d/l]
Re^2: merging dna sequences by garyboyd (Acolyte) on Nov 10, 2011 at 12:29 UTC
Thank you both for your solutions - would this still work for more than 2 strings?	[reply]
Re^3: merging dna sequences by MidLifeXis (Monsignor) on Nov 10, 2011 at 14:09 UTC
Is `f'(f'(A,B),C))` the same as `F(A,B,C)`? If so, then just run a loop on your array of strings, replacing the first two with the results if `f'(A,B)`. By the time you are down to a single string left, you will have the same results as `F(A,B,C,....)`. List::Util's reduce function can be helpful in implementing this. --MidLifeXis	[reply] [d/l] [select]
Re^4: merging dna sequences by Eliya (Vicar) on Nov 10, 2011 at 14:14 UTC
Re^5: merging dna sequences by garyboyd (Acolyte) on Nov 10, 2011 at 16:07 UTC
Re: merging dna sequences by Sinistral (Monsignor) on Nov 10, 2011 at 14:05 UTC
And, in the spirit of providing general information based on your request for assistance, make sure to check out the amazing BioPerl site/distribution/documentation!	[reply]
Re^2: merging dna sequences by Eliya (Vicar) on Nov 10, 2011 at 14:30 UTC
...check out the amazing BioPerl site/distribution/documentation While I generally agree with not reinventing the wheel etc. ... IMHO, when performance matters (and it usually does in this domain), it's well worth checking out alternatives to what BioPerl provides, as BioPerl's solutions often tend to be rather slow.	[reply]