proteins has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am trying to write a script that will accept an alignment of two protein sequences as an input, and given a residue in one sequence return the analogous (aligned) residue in the other sequence.

For example, consider the following alignment output:

SeqA: 36-190	SeqB: 206-334	Length: 155	%identity: 55.137

SeqA: LTIEAVPSNAAEGKEVLLLVHNLPQDPRGYNWYKGETVDANRRIIGYVISNQQITPGPAYSNRETIYPNASLLMRNVTRNDTGSYTLQVIKLNLMSEEVTGQ-FSVHPETPKPSISSNNSNPVEDKDAVAFTCEPETQNTTYLWWVNGQSLPVSP
SeqB: PTISPSYTYYRPGVNLSLSCHAASNPPAQYSWLIDGNIQQHTQE---------------------------LFISNITEKNSGLYTCQANNSASGHSRTTVKTITVSAELPKPSISSNNSKPVEDKDAVAFTCEPEAQNTTYLWWVNGQSLPVSP
                                                                                  ^
Here Sequence A consists of residues numbered 36-190 (represented by single-letter amino acid code, i.e. the sequence begins with leucine (L) 36 and ends with proline (P) 190), with one gap (represented by "-"), and Sequence B consists of residues numbered 206-334 with a larger gap in the middle.

I would like to enter a residue in Sequence B and have the code return the corresponding residue in Sequence A. For instance, if the query was I255 (the residue indicated by the carat), the returned value should be V112 (the corresponding residue in Sequence A).

My code now splits each sequence string into an array, and iterates through the array for SeqB, assigning corresponding residues (e.g. V) and positions (e.g. 112) in Sequence A as values to a hash with the keys being the position in Sequence B, skipping over gaps ("-"), and taking the starting residue number into account.

I was surprised that I couldn't find any other examples of this problem, and was wondering if there was a better way of doing this. Please let me know if any of the above isn't clear.

PearlMonks has proven to be an invaluable resource, and I am very grateful. I look forward to receiving your wisdom as I work through this problem.

  • Comment on Per residue sequence alignment - per character string comparison?

Replies are listed 'Best First'.
Re: Per residue sequence alignment - per character string comparison?
by perldigious (Priest) on Aug 10, 2016 at 01:41 UTC

    If your two sequences are always the exact same length and you simply want to compare equivalent positions (I think I'm understanding that correctly), you may want to simply store each sequence each as their own simple scalar string and make use of the substr function for comparison of positions. That way you could query the position alone and have it return which protein is in that position in both strings.

    Admittedly, I could be oversimplifying the problem if I've misunderstood.

    I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
    I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
Re: Per residue sequence alignment - per character string comparison?
by proteins (Novice) on Aug 10, 2016 at 03:44 UTC
    Here is what I've come up with:
    use strict; use warnings; my $str1 = "LTIEAVPSNAAEGKEVLLLVHNLPQDPRGYNWYKGETVDANRRIIGYVISNQQITPGP +AYSNRETIYPNASLLMRNVTRNDTGSYTLQVIKLNLMSEEVTGQ-FSVHPETPKPSISSNNSNPVEDKD +AVAFTCEPETQNTTYLWWVNGQSLPVSP"; my $str2 = "PTISPSYTYYRPGVNLSLSCHAASNPPAQYSWLIDGNIQQHTQE-------------- +-------------LFISNITEKNSGLYTCQANNSASGHSRTTVKTITVSAELPKPSISSNNSKPVEDKD +AVAFTCEPEAQNTTYLWWVNGQSLPVSP"; my %hash; my @a = split('',$str1); my @b = split('',$str2); my $a_index = 36; #Start counting at residue number my $b_index = 206; for my $i (0 .. $#b) { #Loop through arrays if ($b[$i] eq "-") { #Check for a gap in str2, if so increment a_i +ndex and go to the next residue $a_index++; next; } if ($a[$i] eq "-") { #Check for a gap in str1, if so store the gap + in the hash without a residue #, increment b_index, and go to the ne +xt residue $hash{$b[$i].$b_index}=$a[$i]; $b_index++; next; } $hash{$b[$i].$b_index}=$a[$i].$a_index; #Else store residue pair i +n hash $b_index++; $a_index++; } print "value of N254 is $hash{'N254'}\n"; print "value of I255 is $hash{'I255'}\n"; print "value of K280 is $hash{'K280'}\n"; print "value of T281 is $hash{'T281'}\n"; print "value of I282 is $hash{'I282'}\n";
    Which prints:
    value of N254 is N111
    value of I255 is V112
    value of K280 is Q137
    value of T281 is -
    value of I282 is F138
    
    It seems to be working OK, but now I need to run this on thousands of alignments for an arbitrary number of residues per alignment. If you see any room for improvement or optimization, or errors, please let me know. Thanks again for the help!
      For repeated searches, building a lookup table can speed the program up, but you should really Benchmark instead of guessing.

      Here's what I had tried before you published your code, it counts the dashes, so it adjusts the positions in larger chunks instead of by one:

      #!/usr/bin/perl use warnings; use strict; use Syntax::Construct qw{ // }; sub position { my ($seq, $query) = @_; my $pos = my $sum = $query->[1] - $seq->{from}; my $start = 0; my $changed = 0; while (my $count = substr($seq->{string}, $start, $pos + 1) =~ tr/ +-//) { ++$changed; $start = $sum; $pos = $count - 1; $sum += $count; } --$sum if $changed > 1; my $expected = substr $seq->{string}, $sum, 1; return $sum, $expected } sub find { my ($seq, $idx) = @_; my $char; my $pos; if('-' ne ( $char = substr $seq->{string}, $idx, 1 )) { $pos = $seq->{from} + $idx; $pos -= substr($seq->{string}, 0, $idx) =~ tr/-//; } return $char, $pos } my %seq_a = ( from => 36, to => 190, string => 'LTIEAVPSNAAEGKEVLLLVHNLPQDPRGYNWYKGETVDANRRIJ +GYVISNQQITPGPAYSNRETIYPNASLXMRNVTRNDTGSYTLQVIKLNLMSEEVTGQ-FSVHPETPKPS +ISSNNSNPVEDKDAVAFTCEPETQNTTYLWWVNGQSLPVSP' ); my %seq_b = ( from => 206, to => 334, string => 'PTISPSYTYYRPGVNLSLSCHAASNPPAQYSWLIDGNIQQHTQE- +--------------------------LFISNITEKNSGLYTCQANNSASGHSRTTVKTIYVSAELPKPS +ISSNNSKPVEDKDAVAFTCEPEAQNTTYLWWVNGQSLPVSP' ); use Test::More; my %tab; for my $pos ($seq_b{from} .. $seq_b{to}) { my ($idx, $char) = position(\%seq_b, [ q() => $pos ]); $tab{"$char$pos"} = join q(), map $_ // 'undef', find(\%seq_a, $id +x); } sub assert { is $tab{"$_[0]$_[1]"}, "$_[2]$_[3]", "$_[0]$_[1]"; } assert(P => 206, L => 36); assert(E => 249, I => 79); assert(L => 250, L => 107); assert(F => 251, X => 108); assert(I => 252, M => 109); assert(S => 253, R => 110); assert(N => 254, N => 111); assert(I => 255, V => 112); assert(E => 257, R => 114); assert(A => 271, N => 128); assert(S => 272, L => 129); assert(G => 273, M => 130); assert(H => 274, S => 131); assert(S => 275, E => 132); assert(R => 276, E => 133); assert(T => 277, V => 134); assert(T => 278, T => 135); assert(V => 279, G => 136); assert(K => 280, Q => 137); assert(T => 281, '-' => 'undef'); assert(I => 282, F => 138); done_testing();

      I've changed the sequnces a bit to detect off-by-one errors more easily.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        Thank you for the comprehensive reply. Some of this code is beyond my current Perl comprehension, but I'll work through it and let you know if I have any questions.

        I am currently writing the code to read in the alignments and residues to align from a large text file. I'll run Benchmark on a subset of the data once that is working.

Re: Per residue sequence alignment - per character string comparison?
by choroba (Cardinal) on Aug 10, 2016 at 02:09 UTC
    What's the expected output for T281 and the surrounding?

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Per residue sequence alignment - per character string comparison?
by proteins (Novice) on Aug 10, 2016 at 02:28 UTC

    I was thinking of using substr initially, but was having difficulty with the presence of gaps in the alignment throwing off the numbering. Thus I was going through the string counting non-gap characters, and figured it would be faster and easier to read in the strings and generate a lookup hash.

    The expected output for T281 would be a gap without a position number, or just return false. Based on how the alignments are done, it would be highly unlikely for a desired residue in the query strand to align with a gap in the subject strand. Likewise, K280 would return Q137, and I282 would return F138.

      It looks like choroba has provided you with a working solution to your problem.

      Just a tip for this type of forum for asking questions, it's usually helpful if you reply directly to each node where a follow up question is asked with the answer to that question. In the case here of your node I'm replying directly to, your first paragraph would fit more logically as a response to my question and your second would fit more logically as a response to choroba's question. That way questions and answers "stick" together and make it easier for other monks to read and understand what's going on, which in turn makes it easier for them to help you :-).

      Not using direct replies this way means your node here may not necessarily stay near to the follow up questions that were asked, which can be confusing. It's not a big deal here because there were not very many nodes in the thread anyway, but if the thread gets larger the sequence of the conversation can quickly be lost.

      I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites
      I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious
        Thank you for the tip. I didn't realize the nodes floated around by replies instead of being in a traditional threaded format.