Re: Per residue sequence alignment - per character string comparison?

Here is what I've come up with:

use strict;
use warnings;

my $str1 = "LTIEAVPSNAAEGKEVLLLVHNLPQDPRGYNWYKGETVDANRRIIGYVISNQQITPGP
+AYSNRETIYPNASLLMRNVTRNDTGSYTLQVIKLNLMSEEVTGQ-FSVHPETPKPSISSNNSNPVEDKD
+AVAFTCEPETQNTTYLWWVNGQSLPVSP";
my $str2 = "PTISPSYTYYRPGVNLSLSCHAASNPPAQYSWLIDGNIQQHTQE--------------
+-------------LFISNITEKNSGLYTCQANNSASGHSRTTVKTITVSAELPKPSISSNNSKPVEDKD
+AVAFTCEPEAQNTTYLWWVNGQSLPVSP";
my %hash;
my @a = split('',$str1);
my @b = split('',$str2);

my $a_index = 36; #Start counting at residue number
my $b_index = 206;
for my $i (0 .. $#b) { #Loop through arrays
    
    if ($b[$i] eq "-") { #Check for a gap in str2, if so increment a_i
+ndex and go to the next residue
        $a_index++;
        next;
    }
    if ($a[$i] eq "-") { #Check for a gap in str1, if so store the gap
+ in the hash without a residue #, increment b_index, and go to the ne
+xt residue
        $hash{$b[$i].$b_index}=$a[$i];
        $b_index++;
        next;
    }
    $hash{$b[$i].$b_index}=$a[$i].$a_index; #Else store residue pair i
+n hash
    $b_index++;
    $a_index++;
}

print "value of N254 is $hash{'N254'}\n";
print "value of I255 is $hash{'I255'}\n";
print "value of K280 is $hash{'K280'}\n";
print "value of T281 is $hash{'T281'}\n";
print "value of I282 is $hash{'I282'}\n";
[download]

Which prints:

value of N254 is N111
value of I255 is V112
value of K280 is Q137
value of T281 is -
value of I282 is F138

It seems to be working OK, but now I need to run this on thousands of alignments for an arbitrary number of residues per alignment. If you see any room for improvement or optimization, or errors, please let me know. Thanks again for the help!

Comment on Re: Per residue sequence alignment - per character string comparison? Download Code

Replies are listed 'Best First'.
Re^2: Per residue sequence alignment - per character string comparison? by choroba (Cardinal) on Aug 10, 2016 at 08:33 UTC
For repeated searches, building a lookup table can speed the program up, but you should really Benchmark instead of guessing. Here's what I had tried before you published your code, it counts the dashes, so it adjusts the positions in larger chunks instead of by one: #!/usr/bin/perl use warnings; use strict; use Syntax::Construct qw{ // }; sub position { my ($seq, $query) = @_; my $pos = my $sum = $query->[1] - $seq->{from}; my $start = 0; my $changed = 0; while (my $count = substr($seq->{string}, $start, $pos + 1) =~ tr/ +-//) { ++$changed; $start = $sum; $pos = $count - 1; $sum += $count; } --$sum if $changed > 1; my $expected = substr $seq->{string}, $sum, 1; return $sum, $expected } sub find { my ($seq, $idx) = @_; my $char; my $pos; if('-' ne ( $char = substr $seq->{string}, $idx, 1 )) { $pos = $seq->{from} + $idx; $pos -= substr($seq->{string}, 0, $idx) =~ tr/-//; } return $char, $pos } my %seq_a = ( from => 36, to => 190, string => 'LTIEAVPSNAAEGKEVLLLVHNLPQDPRGYNWYKGETVDANRRIJ +GYVISNQQITPGPAYSNRETIYPNASLXMRNVTRNDTGSYTLQVIKLNLMSEEVTGQ-FSVHPETPKPS +ISSNNSNPVEDKDAVAFTCEPETQNTTYLWWVNGQSLPVSP' ); my %seq_b = ( from => 206, to => 334, string => 'PTISPSYTYYRPGVNLSLSCHAASNPPAQYSWLIDGNIQQHTQE- +--------------------------LFISNITEKNSGLYTCQANNSASGHSRTTVKTIYVSAELPKPS +ISSNNSKPVEDKDAVAFTCEPEAQNTTYLWWVNGQSLPVSP' ); use Test::More; my %tab; for my $pos ($seq_b{from} .. $seq_b{to}) { my ($idx, $char) = position(\%seq_b, [ q() => $pos ]); $tab{"$char$pos"} = join q(), map $_ // 'undef', find(\%seq_a, $id +x); } sub assert { is $tab{"$_[0]$_[1]"}, "$_[2]$_[3]", "$_[0]$_[1]"; } assert(P => 206, L => 36); assert(E => 249, I => 79); assert(L => 250, L => 107); assert(F => 251, X => 108); assert(I => 252, M => 109); assert(S => 253, R => 110); assert(N => 254, N => 111); assert(I => 255, V => 112); assert(E => 257, R => 114); assert(A => 271, N => 128); assert(S => 272, L => 129); assert(G => 273, M => 130); assert(H => 274, S => 131); assert(S => 275, E => 132); assert(R => 276, E => 133); assert(T => 277, V => 134); assert(T => 278, T => 135); assert(V => 279, G => 136); assert(K => 280, Q => 137); assert(T => 281, '-' => 'undef'); assert(I => 282, F => 138); done_testing(); [download] I've changed the sequnces a bit to detect off-by-one errors more easily. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^3: Per residue sequence alignment - per character string comparison? by proteins (Novice) on Aug 10, 2016 at 13:27 UTC
Thank you for the comprehensive reply. Some of this code is beyond my current Perl comprehension, but I'll work through it and let you know if I have any questions. I am currently writing the code to read in the alignments and residues to align from a large text file. I'll run Benchmark on a subset of the data once that is working.	[reply]