neversaint has asked for the wisdom of the Perl Monks concerning the following question:
Dear Masters,
I have the following data and problem:
my $k =25;
my %readrepo =(
"readA" => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA",
"readB" => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC",
"readC" => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG",
"readD" => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC",
);
# This array is already ordered
# It says the first(_1) 25 bases of readA overlap with second(_2)part
+of readB, etc.
my @readstoconcate = (
"readA_1",
"readB_2",
"readC_1",
"readD_2");
What I want to do is to assemble the reads in %readrepo based on the order
given by @readstoconcatenate
readA GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA
readB TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC
readC GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG
readD TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC
Yielding the final long stretch of sequence
TACTC(A,G)GGAGGAGAATTGCTTGAACCTGGGAGGCA(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A
+,G)
How should one go about it?
Update: Corrected the answer bug above. Thanks to BrowserUK suggestion.
---
neversaint and everlastingly indebted.......
Re: Mustering Reads
by Corion (Patriarch) on Nov 01, 2010 at 10:14 UTC
|
Likely, substr is all you need, if you already know what parts overlap and what parts do not overlap. Once you have the non-overlapping parts, you can go through them character by character and if they match, you output that character, and if they don't match, you output the pair. I think there can only be two pairs anyway, so a fancy trick would be to use binary XOR on the "nonmatched" parts:
use strict;
my @nonmatched = qw(TACTCAGGAG TACTCGGGAG);
my $difference = $nonmatched[0] ^ $nonmatched[ 1 ];
# Replace all zeroes (= matches) by the original string:
$difference =~ s/\0/substr($nonmatched[0],pos($difference),1)/ge;
# Replace the "weird" parts with the proper pairs
my %pairs = (
'C' ^ 'T' => '(C,T)',
'A' ^ 'G' => '(A,G)',
);
$difference =~ s/(.)/$pairs{$1} || $1/ge;
print $difference;
| [reply] [d/l] |
Re: Mustering Reads
by JavaFan (Canon) on Nov 01, 2010 at 12:16 UTC
|
The following does seem to do the trick. It should work for any number of "repo"s. It does assume all are of the same length, but it wouldn't be complicated to adjust for varying lengths:
use 5.010;
use strict;
use warnings;
my $k = 25;
my %readrepo =(
readA => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA",
readB => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC",
readC => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG",
readD => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC",
);
my @readstoconcate = (
"readA_1",
"readB_2",
"readC_1",
"readD_2",
);
my @prefixes;
my @postfixes;
my $fixed;
foreach my $read (@readstoconcate) {
my ($entry, $tag) = split '_', $read;
if ($tag == 1) {
push @postfixes, substr $readrepo{$entry}, $k;
$fixed ||= substr $readrepo{$entry}, 0, $k;
}
else {
my $k1 = length($readrepo{$entry}) - $k;
push @prefixes, substr $readrepo{$entry}, 0, $k1;
$fixed ||= substr $readrepo{$entry}, $k1;
}
}
#
# Assume all entries in %readrepo are the same length.
#
foreach my $set (\@prefixes, \@postfixes) {
next unless @$set;
for (my $i = 0; $i < length $$set[0]; $i++) {
my $l = "\x0";
my @c = grep {my $r = ($_ ne $l); $l = $_; $r}
map {substr $_, $i, 1} @$set;
local $" = ",";
print @c == 1 ? $c[0] : "(@c)";
}
}
continue {
state $flag = 0;
print $fixed unless $flag++;
}
print "\n";
__END__
TACTC(A,G)GGAGGCTGAGGCAGGAGAATTGCTTGAAC(C,T)T(G,A)GG(A,G)G(G,A)(C,T)(A
+,G)
| [reply] [d/l] |
Re: Mustering Reads
by BrowserUk (Patriarch) on Nov 01, 2010 at 11:12 UTC
|
This produces a slightly different output for your sample than you provided--but I think mine is right :)
(You'll have to explain to me why it isn't, if it isn't.)
This could probably be made quite a bit more efficient with further thought, but I wanted to check if it is correct first.
#! perl -slw
use strict;
my $k = 25;
my %repo = (
"readA" => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA",
"readB" => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC",
"readC" => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG",
"readD" => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC",
);
my @order = ( "readA_1", "readB_2", "readC_1", "readD_2");
my( @heads, @tails, $common );
while( @order ) {
my( $s1, $p1, $s2, $p2 ) = map split( '_', shift @order ), 1 .. 2;
( $s1, $s2 ) = ( $s2, $s1 ) if $p1 > $p2;
push @heads, substr $repo{ $s2 }, 0, length( $repo{ $s2 } ) - $k;
push @tails, substr $repo{ $s1 }, $k;
$common = substr $repo{ $s1 }, -$k unless $common;
}
my $head = '';
for my $p ( 0 .. length( $heads[0] )-1 ) {
my %uniq;
++$uniq{ substr $heads[ $_ ], $p, 1 } for 0 .. $#heads;
if( keys %uniq > 1 ) {
$head .= '(' . join( ',', keys %uniq ) . ')';
}
else {
$head .= each %uniq;
}
}
my $tail = '';
for my $p ( 0 .. length( $tails[0] )-1 ) {
my %uniq;
++$uniq{ substr $tails[ $_ ], $p, 1 } for 0 .. $#tails;
if( keys %uniq > 1 ) {
$tail .= '(' . join( ',', keys %uniq ) . ')';
}
else {
$tail .= each %uniq;
}
}
print $head, $common, $tail;
__END__
c:\test>868716
TACTC(A,G)GGAGGAGAATTGCTTGAACCTGGGAGGCA(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A
+,G)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Dear BrowserUK,
I was trying the following input to your code above:
my $k = 15;
my %repo = (
"readA" => "AAACATGAAAAGAAATGATGAAACAGA",
"readB" => "AAATCATGAAACAGAGCCTCATCTCCC ",
);
my @order = ( "readB_1","readA_2");
Why it gives this output
AAACATGAAAAGGAGCCTCATCTCCC GCCTCATCTCCC
Instead of:
AAACATGAAAAGAAAT(G,C)ATGAAACAGAGCCTCATCTCCC
Following this alignment:
readA AAACATGAAAAGAAATGATGAAACAGA
readB AAATCATGAAACAGAGCCTCATCTCCC
How can I fix it?
---
neversaint and everlastingly indebted.......
| [reply] [d/l] [select] |
|
How can I fix it?
By giving better examples :) I had assumed on the basis of the sample supplied, that the overlapping sections were identical as in both cases supplied. I think javafans make that same assumption.
You'd need to accumulate the overlaps in a third array in the while, and then add another for loop to process them is the same way as @heads & @tails. (There obviously scope there for making the for loop a subroutine.
I'm done for today, but I'll be back tonight.
| [reply] |
|
#! perl -slw
use strict;
use Data::Dump qw[ pp ];
my $k = 15;
my %repo = (
"readA" => "AAACATGAAAAGAAATGATGAAACAGA",
"readB" => "AAATCATGAAACAGAGCCTCATCTCCC ",
);
my @order = ( "readB_1","readA_2");
#my $k = 25;
#my %repo = (
# "readA" => "GCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCA",
# "readB" => "TACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAAC",
# "readC" => "GCTGAGGCAGGAGAATTGCTTGAACTTAGGGGATG",
# "readD" => "TACTCGGGAGGCTGAGGCAGGAGAATTGCTTGAAC",
#);
#my @order = ( "readA_1", "readB_2", "readC_1", "readD_2");
sub muster {
local $^W;
my( $bits ) = @_;
my $muster = '';
for my $p ( 0 .. length( $bits->[0] )-1 ) {
my %uniq;
++$uniq{ substr $bits->[ $_ ], $p, 1 } for 0 .. $#$bits;
if( keys %uniq > 1 ) {
$muster .= '(' . join( ',', keys %uniq ) . ')';
}
else {
$muster .= each %uniq;
}
}
return $muster;
}
my( @heads, @bodys, @tails );
while( @order ) {
my( $s1, $p1, $s2, $p2 ) = map split( '_', shift @order ), 1 .. 2;
( $s1, $s2 ) = ( $s2, $s1 ) if $p1 > $p2;
push @heads, substr $repo{ $s2 }, 0, length( $repo{ $s2 } ) - $k;
push @bodys, substr $repo{ $s2 }, -$k;
push @tails, substr $repo{ $s1 }, $k;
push @bodys, substr $repo{ $s1 }, 0, $k;
}
my $head = muster( \@heads );
my $body = muster( \@bodys );
my $tail = muster( \@tails );
print $head, $body, $tail;
__END__
TACTC(A,G)GGAGGAGAATTGCTTGAACCTGGGAGGCA(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A
+,G)
TACTC(A,G)GGAGGCTGAGGCAGGAGAATTGCTTGAAC(T,C)T(A,G)GG(A,G)G(A,G)(T,C)(A
+,G)
AAACATGAAAAGAAAT(C,G)ATGAAACAGAGCCTCATCTCCC
AAACATGAAAAGAAAT(G,C)ATGAAACAGAGCCTCATCTCCC
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Mustering Reads (definition?)
by LanX (Saint) on Nov 01, 2010 at 11:11 UTC
|
Sorry, could you please explain the meaning of "mustering" or provide a link with a definition?
I tried to google it, but without much success and I'm not proficient in the arts of bioinformatics.
| [reply] |
|
|