If your two sequences are always the exact same length and you simply want to compare equivalent positions (I think I'm understanding that correctly), you may want to simply store each sequence each as their own simple scalar string and make use of the substr function for comparison of positions. That way you could query the position alone and have it return which protein is in that position in both strings.
Admittedly, I could be oversimplifying the problem if I've misunderstood.
I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
| [reply] [d/l] |
Here is what I've come up with: use strict;
use warnings;
my $str1 = "LTIEAVPSNAAEGKEVLLLVHNLPQDPRGYNWYKGETVDANRRIIGYVISNQQITPGP
+AYSNRETIYPNASLLMRNVTRNDTGSYTLQVIKLNLMSEEVTGQ-FSVHPETPKPSISSNNSNPVEDKD
+AVAFTCEPETQNTTYLWWVNGQSLPVSP";
my $str2 = "PTISPSYTYYRPGVNLSLSCHAASNPPAQYSWLIDGNIQQHTQE--------------
+-------------LFISNITEKNSGLYTCQANNSASGHSRTTVKTITVSAELPKPSISSNNSKPVEDKD
+AVAFTCEPEAQNTTYLWWVNGQSLPVSP";
my %hash;
my @a = split('',$str1);
my @b = split('',$str2);
my $a_index = 36; #Start counting at residue number
my $b_index = 206;
for my $i (0 .. $#b) { #Loop through arrays
if ($b[$i] eq "-") { #Check for a gap in str2, if so increment a_i
+ndex and go to the next residue
$a_index++;
next;
}
if ($a[$i] eq "-") { #Check for a gap in str1, if so store the gap
+ in the hash without a residue #, increment b_index, and go to the ne
+xt residue
$hash{$b[$i].$b_index}=$a[$i];
$b_index++;
next;
}
$hash{$b[$i].$b_index}=$a[$i].$a_index; #Else store residue pair i
+n hash
$b_index++;
$a_index++;
}
print "value of N254 is $hash{'N254'}\n";
print "value of I255 is $hash{'I255'}\n";
print "value of K280 is $hash{'K280'}\n";
print "value of T281 is $hash{'T281'}\n";
print "value of I282 is $hash{'I282'}\n";
Which prints:
value of N254 is N111
value of I255 is V112
value of K280 is Q137
value of T281 is -
value of I282 is F138
It seems to be working OK, but now I need to run this on thousands of alignments for an arbitrary number of residues per alignment. If you see any room for improvement or optimization, or errors, please let me know. Thanks again for the help! | [reply] [d/l] |
#!/usr/bin/perl
use warnings;
use strict;
use Syntax::Construct qw{ // };
sub position {
my ($seq, $query) = @_;
my $pos = my $sum = $query->[1] - $seq->{from};
my $start = 0;
my $changed = 0;
while (my $count = substr($seq->{string}, $start, $pos + 1) =~ tr/
+-//) {
++$changed;
$start = $sum;
$pos = $count - 1;
$sum += $count;
}
--$sum if $changed > 1;
my $expected = substr $seq->{string}, $sum, 1;
return $sum, $expected
}
sub find {
my ($seq, $idx) = @_;
my $char;
my $pos;
if('-' ne ( $char = substr $seq->{string}, $idx, 1 )) {
$pos = $seq->{from} + $idx;
$pos -= substr($seq->{string}, 0, $idx) =~ tr/-//;
}
return $char, $pos
}
my %seq_a = ( from => 36,
to => 190,
string => 'LTIEAVPSNAAEGKEVLLLVHNLPQDPRGYNWYKGETVDANRRIJ
+GYVISNQQITPGPAYSNRETIYPNASLXMRNVTRNDTGSYTLQVIKLNLMSEEVTGQ-FSVHPETPKPS
+ISSNNSNPVEDKDAVAFTCEPETQNTTYLWWVNGQSLPVSP' );
my %seq_b = ( from => 206,
to => 334,
string => 'PTISPSYTYYRPGVNLSLSCHAASNPPAQYSWLIDGNIQQHTQE-
+--------------------------LFISNITEKNSGLYTCQANNSASGHSRTTVKTIYVSAELPKPS
+ISSNNSKPVEDKDAVAFTCEPEAQNTTYLWWVNGQSLPVSP' );
use Test::More;
my %tab;
for my $pos ($seq_b{from} .. $seq_b{to}) {
my ($idx, $char) = position(\%seq_b, [ q() => $pos ]);
$tab{"$char$pos"} = join q(), map $_ // 'undef', find(\%seq_a, $id
+x);
}
sub assert {
is $tab{"$_[0]$_[1]"}, "$_[2]$_[3]", "$_[0]$_[1]";
}
assert(P => 206, L => 36);
assert(E => 249, I => 79);
assert(L => 250, L => 107);
assert(F => 251, X => 108);
assert(I => 252, M => 109);
assert(S => 253, R => 110);
assert(N => 254, N => 111);
assert(I => 255, V => 112);
assert(E => 257, R => 114);
assert(A => 271, N => 128);
assert(S => 272, L => 129);
assert(G => 273, M => 130);
assert(H => 274, S => 131);
assert(S => 275, E => 132);
assert(R => 276, E => 133);
assert(T => 277, V => 134);
assert(T => 278, T => 135);
assert(V => 279, G => 136);
assert(K => 280, Q => 137);
assert(T => 281, '-' => 'undef');
assert(I => 282, F => 138);
done_testing();
I've changed the sequnces a bit to detect off-by-one errors more easily.
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
Thank you for the comprehensive reply.
Some of this code is beyond my current Perl comprehension, but I'll work through it and let you know if I have any questions.
I am currently writing the code to read in the alignments and residues to align from a large text file. I'll run Benchmark on a subset of the data once that is working.
| [reply] |
What's the expected output for T281 and the surrounding?
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] |
I was thinking of using substr initially, but was having difficulty with the presence of gaps in the alignment throwing off the numbering. Thus I was going through the string counting non-gap characters, and figured it would be faster and easier to read in the strings and generate a lookup hash.
The expected output for T281 would be a gap without a position number, or just return false. Based on how the alignments are done, it would be highly unlikely for a desired residue in the query strand to align with a gap in the subject strand. Likewise, K280 would return Q137, and I282 would return F138.
| [reply] |
It looks like choroba has provided you with a working solution to your problem.
Just a tip for this type of forum for asking questions, it's usually helpful if you reply directly to each node where a follow up question is asked with the answer to that question. In the case here of your node I'm replying directly to, your first paragraph would fit more logically as a response to my question and your second would fit more logically as a response to choroba's question. That way questions and answers "stick" together and make it easier for other monks to read and understand what's going on, which in turn makes it easier for them to help you :-).
Not using direct replies this way means your node here may not necessarily stay near to the follow up questions that were asked, which can be confusing. It's not a big deal here because there were not very many nodes in the thread anyway, but if the thread gets larger the sequence of the conversation can quickly be lost.
I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
| [reply] |
Thank you for the tip. I didn't realize the nodes floated around by replies instead of being in a traditional threaded format.
| [reply] |