Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Translation Substring Error

by FIJI42 (Acolyte)
on Nov 09, 2017 at 15:26 UTC ( [id://1203047]=perlquestion: print w/replies, xml ) Need Help??

FIJI42 has asked for the wisdom of the Perl Monks concerning the following question:

I have a subroutine for a basic one frame translation that is giving me an error for "Use of uninitialized value $codon in hash element" and "substr outside of string". I think my problem is I need to modify the subroutine's for loop to account for nucleotide sequences with odd numbers of acids (i.e. not in multiples of 3).

Does anyone have suggestions for how to modify the code properly?

Here is the subroutine I'm using in a simple example:

use strict; use warnings; my $amino_acid=''; my $s1 = 'ATGCCCGTAC'; ## Sequence 1 my $s2 = 'GCTTCCCAGCGC'; ## Sequence 2 print "Sequence 1 Translation:"; OneFrameTranslation ($s1); ## Calls subroutine print "$amino_acid\n"; print "Sequence 2 Translation:"; OneFrameTranslation ($s2); ## Calls subroutine print "$amino_acid\n"; ### Subroutine ### sub OneFrameTranslation { my ($seq) = shift; my $amino_acid=''; my $seqarray=''; my %genetic_code = ( 'TTT' => 'F', 'TTC' => 'F', 'TTA' => 'L', 'TTG' => 'L', 'CTT' => 'L', 'CTC' => 'L', 'CTA' => 'L', 'CTG' => 'L', 'ATT' => 'I', 'ATC' => 'I', 'ATA' => 'I', 'ATG' => 'M', 'GTT' => 'V', 'GTC' => 'V', 'GTA' => 'V', 'GTG' => 'V', 'TCT' => 'S', 'TCC' => 'S', 'TCA' => 'S', 'TCG' => 'S', 'CCT' => 'P', 'CCC' => 'P', 'CCA' => 'P', 'CCG' => 'P', 'ACT' => 'T', 'ACC' => 'T', 'ACA' => 'T', 'ACG' => 'T', 'GCT' => 'A', 'GCC' => 'A', 'GCA' => 'A', 'GCG' => 'A', 'TAT' => 'Y', 'TAC' => 'Y', 'TAA' => '*', 'TAG' => '*', 'CAT' => 'H', 'CAC' => 'H', 'CAA' => 'Q', 'CAG' => 'Q', 'AAT' => 'N', 'AAC' => 'N', 'AAA' => 'K', 'AAG' => 'K', 'GAT' => 'D', 'GAC' => 'D', 'GAA' => 'E', 'GAG' => 'E', 'TGT' => 'C', 'TGC' => 'C', 'TGA' => '*', 'TGG' => 'W', 'CGT' => 'R', 'CGC' => 'R', 'CGA' => 'R', 'CGG' => 'R', 'AGT' => 'S', 'AGC' => 'S', 'AGA' => 'R', 'AGG' => 'R', 'GGT' => 'G', 'GGC' => 'G', 'GGA' => 'G', 'GGG' => 'G' ); ## '---' = 3 character codon in hash above ## '-' = one letter amino acid abbreviation in hash above my @seqarray = split(//,$seq); ## Explodes the string for (my $i=0; $i<=$#seqarray-2; $i=$i+3) { my $codon = substr($seqarray,$i,3); $amino_acid = $genetic_code{$codon}; } return ($amino_acid); }

Replies are listed 'Best First'.
Re: Translation Substring Error (updated)
by haukex (Archbishop) on Nov 09, 2017 at 15:47 UTC

    @seqarray and $seqarray are two different variables, and you never assign anything to $seqarray, so using substr on it does not make much sense, I suspect you just want to look directly at $seq instead of splitting it (BTW, to get multiple elements out of an array, use Slices or splice). Also, note that you overwrite $amino_acid on every loop iteration. The following minimal changes make your code work for me:

    my $seq = shift; my $amino_acid; for (my $i=0; $i<=length($seq)-3; $i=$i+3) { my $codon = substr($seq,$i,3); $amino_acid .= $genetic_code{$codon}; } return $amino_acid;

    <update2> Fixed an off-by-one error in the above code; I initially incorrectly translated your $#seqarray-2 into length($seq)-2 ($#seqarray returns the last index of the array, not its length like scalar(@seqarray) does, or length does for strings). That's a good argument against the classic for(;;) and for the two solutions below instead :-) </update2>

    If you output the return value from OneFrameTranslation (your current code is ignoring the return value), this gives you:

    print OneFrameTranslation('ATGCCCGTAC'),"\n"; print OneFrameTranslation('GCTTCCCAGCGC'),"\n"; __END__ MPV ASQR

    By the way, you can probably move your %genetic_code to the top of your code (outside of the sub), so that it only gets initialized once instead of on every call to the sub, and making its name uppercase is the usual convention to indicate it is a constant that should not be changed.

    Another way to break up a string is using regular expressions, the following also works - it matches three characters, and then matches again at the position that the previous match finished, and so on:

    my $amino_acid; while ($seq=~/\G(...)/sg) { $amino_acid .= $genetic_code{$1}; } return $amino_acid;

    Or, possibly going a little overboard, here's a technique I describe in Building Regex Alternations Dynamically to make the replacements using a single regex. I have left out the quotemeta and sort steps only because I know for certain that all keys are three-character strings without any special characters, if you have any doubts about the input data, put those steps back in!

    # build the regex, this only needs to be done once my ($genetic_regex) = map qr/$_/, join '|', keys %genetic_code; # apply the regex (my $amino_acid = $seq) =~ s/($genetic_regex)/$genetic_code{$1}/g; return $amino_acid;

    However, note this produces slightly different output for the first input: "MPVC" (the leftover C remains unchanged). Whether or not you want this behavior or not is up to you; it can also be accomplished in the first two solutions (although slightly less elegantly than with a regex). Update: Also, in the first two solutions you haven't defined what would happen if a code happens to not be available in the table; the third regex solution would simply leave it unchanged. Also minor edits for clarification.

      Good point. If a nucleotide triplet with an unknown nucleotide appears (ex. ANC instead of ATC), I'd want to either skip those, or mark them with a letter like 'X'.

      I do like the regex solution though, it's quite elegant.

        If a nucleotide triplet with an unknown nucleotide appears (ex. ANC instead of ATC), I'd want to either skip those, or mark them with a letter like 'X'.

        In the first two solutions, you can use exists, e.g.:

        if ( exists $genetic_code{$codon} ) { $amino_acid .= $genetic_code{$codon}; } else { $amino_acid .= $codon; # - OR - $amino_acid .= 'X'; # or something else... }

        Update: Or, written more tersely, either $amino_acid .= exists $genetic_code{$codon} ? $genetic_code{$codon} : 'X'; or $amino_acid .= $genetic_code{$codon} // 'X'; (the former uses the Conditional Operator, and the latter uses Logical Defined Or instead of exists, assuming you don't have any undef values in your hash).

        I do like the regex solution though, it's quite elegant.

        You can combine my second and third suggestions (for nonexistent codes, this uses the defined-or solution I showed here, the exists solution would work as well):

        (my $amino_acid = $seq) =~ s{(...)} { $genetic_code{$1} // 'X' }esg; return $amino_acid;
Re: Translation Substring Error
by toolic (Bishop) on Nov 09, 2017 at 15:49 UTC
    The reason for the "substr outside of string" warning is that you assign the $seqarray variable to the empty string and you never assign it any other value. You are likely getting confused because you use the same name for two variables (an array and a scalar): $seqarray is a different variable from @seqarray. If you can specify what you want for output, you will get more specific help.

    See also:

      Basically, I was just trying to get a string for the translated amino acids:

      Example: MLVG

      If I have sequence like this: ATGGCGA, then I'd just like the translation: MA. The "A" from the end of "ATGGCGA" can be ignored/not output.

        This is what I'm getting with your program modified as in my earlier post below:
        $ perl dna.pl Sequence 1 Translation:MPV Sequence 2 Translation:ASQR
Re: Translation Substring Error
by Laurent_R (Canon) on Nov 09, 2017 at 16:01 UTC
    Hi,

    Try this:

    use strict; use warnings; my $s1 = 'ATGCCCGTAC'; ## Sequence 1 my $s2 = 'GCTTCCCAGCGC'; ## Sequence 2 print "Sequence 1 Translation:"; my $amino_acid = OneFrameTranslation ($s1); ## Calls subroutine print "$amino_acid\n"; print "Sequence 2 Translation:"; $amino_acid = OneFrameTranslation ($s2); ## Calls subroutine print "$amino_acid\n"; ### Subroutine ### sub OneFrameTranslation { my ($seq) = shift; my $amino_acid=''; my $seqarray=''; my %genetic_code = ( 'TTT' => 'F', 'TTC' => 'F', 'TTA' => 'L', 'TTG' => 'L', 'CTT' => 'L', 'CTC' => 'L', 'CTA' => 'L', 'CTG' => 'L', 'ATT' => 'I', 'ATC' => 'I', 'ATA' => 'I', 'ATG' => 'M', 'GTT' => 'V', 'GTC' => 'V', 'GTA' => 'V', 'GTG' => 'V', 'TCT' => 'S', 'TCC' => 'S', 'TCA' => 'S', 'TCG' => 'S', 'CCT' => 'P', 'CCC' => 'P', 'CCA' => 'P', 'CCG' => 'P', 'ACT' => 'T', 'ACC' => 'T', 'ACA' => 'T', 'ACG' => 'T', 'GCT' => 'A', 'GCC' => 'A', 'GCA' => 'A', 'GCG' => 'A', 'TAT' => 'Y', 'TAC' => 'Y', 'TAA' => '*', 'TAG' => '*', 'CAT' => 'H', 'CAC' => 'H', 'CAA' => 'Q', 'CAG' => 'Q', 'AAT' => 'N', 'AAC' => 'N', 'AAA' => 'K', 'AAG' => 'K', 'GAT' => 'D', 'GAC' => 'D', 'GAA' => 'E', 'GAG' => 'E', 'TGT' => 'C', 'TGC' => 'C', 'TGA' => '*', 'TGG' => 'W', 'CGT' => 'R', 'CGC' => 'R', 'CGA' => 'R', 'CGG' => 'R', 'AGT' => 'S', 'AGC' => 'S', 'AGA' => 'R', 'AGG' => 'R', 'GGT' => 'G', 'GGC' => 'G', 'GGA' => 'G', 'GGG' => 'G' ); ## '---' = 3 character codon in hash above ## '-' = one letter amino acid abbreviation in hash above my @seqarray = split(//,$seq); ## Explodes the string for (my $i=0; $i<=$#seqarray-2; $i=$i+3) { my $codon = substr($seq,$i,3); $amino_acid .= $genetic_code{$codon}; } return ($amino_acid); }
    The main errors in your code is that the $seqarray is never initialized to anything (note that this is different from @seqarray) and that you don't use the return values from your subroutines.

    Update: haukex and toolic were faster than me. Also note I only made the minimal changes, you don't really need to create @seqarray, since you're not really using it (except in the $i<=$#seqarray-2 for loop termination clause where you could simply use the length of the sequence).

      This works great, thank you.

      Yeah, I see the error with with $seqarray - I'll try to more dynamic variable names to minimize confusion next time.

Re: Translation Substring Error
by kcott (Archbishop) on Nov 11, 2017 at 07:52 UTC

    G'day FIJI42,

    I wrote in "Re: Identifying Overlapping Matches in Nucleotide Sequence":

    "Biological data are typically huge. For reasons of efficiency, when dealing with this type of data, you should choose a fast solution over a slower one. Perl's string handling functions ... are measurably faster than regexes ..."

    Here's a solution that uses the string handling functions length and substr (no regexes are used at all):

    #!/usr/bin/env perl -l use strict; use warnings; my @dna_seqs = qw{ATGCCCGTAC GCTTCCCAGCGC}; print "$_ => ", dna_prot_map($_) for @dna_seqs; { my %code; BEGIN { %code = qw{ATG M CCC P GTA V GCT A TCC S CAG Q CGC R} } sub dna_prot_map { join '', map $code{substr $_[0], $_*3, 3}, 0..length($_[0])/3- +1 } }

    Output:

    ATGCCCGTAC => MPV GCTTCCCAGCGC => ASQR

    Notes:

    My %code is just a subset of your %genetic_code: it only has the data required for your example sequences. You will still need all the data; you can save yourself some typing by omitting the 128 single quotes around all the keys.

    You can use state within your subroutine (if you're using Perl version 5.10 or higher); although, be aware that limits the scope. I often find that when I write code like:

    sub f { state $static_var = ... ... do something with $static_var here ... }

    instead of like:

    { my $static_var; BEGIN { $static_var = ... } sub f { ... do something with $static_var here ... } }

    I subsequently find I need to share $static_var with another routine. This requires a major rewrite which ends up looking very much like the version with BEGIN:

    { my $static_var; BEGIN { $static_var = ... } sub f { ... do something with $static_var here ... } sub g { ... do something with $static_var here ... } }

    Just having to add 'sub g { ... }' to existing code is a lot less work and a lot less error-prone.

    How you choose to do it is up to you: I'm only providing advice of possible pitfalls based on my experience.

    — Ken

Re: Translation Substring Error
by johngg (Canon) on Nov 09, 2017 at 23:10 UTC

    This is not addressing the problem you were having, rather it is a suggestion for a simpler way of initialising your %genetic_code hash that would save some typing. The glob function can be used to generate combinations of letters. Your hash contains 64 keys which are all possible 3-character combinations of A, C, G and T. These can be generated using glob like this ...

    johngg@shiraz:~/perl/Monks > perl -E 'say for glob q{{A,C,G,T}} x 3' AAA AAC AAG AAT ACA ACC ACG ACT ... TGA TGC TGG TGT TTA TTC TTG TTT

    Arranging the corresponding amino acid letters in an array allows us to map keys (genetic codes) and values (amino acids) shift'ed from the array together to create the hash lookup.

    my %genetic_code = do { my @amino_acids = qw{ K N K N T T T T R S R S I I M I Q H Q H P P P P R R R R L L L L E D E D A A A A G G G G V V V V * Y * Y S S S S * C W C L F L F }; map { $_ => shift @amino_acids } glob q{{A,C,G,T}} x 3; };

    I hope this is of interest.

    Cheers,

    JohnGG

Re: Translation Substring Error
by Ultimatt (Acolyte) on Nov 14, 2017 at 13:53 UTC
    Unless this is homework/fun/practice why not use BioPerl and Bio::Seq? For example you aren't handling all the confusion characters that might be seen, only support a single translation table etc. If this is some academic work you're better off avoiding that particular chunk of maintenance. About the only good thing in Bio Perl you can rely on is the data rep and IO code. I wouldn't trust most of the stats or other more algorithmic stuff, but all the basic format code is almost certainly more feature complete than what you will roll yourself.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1203047]
Approved by marto
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-04-19 12:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found