dmunoze has asked for the wisdom of the Perl Monks concerning the following question:

Good morning. First at all I'd like to rather English is not my native language, so I apologize for any misunderstanding or spelling mistake. Anyhow, I have this code you'll see below, and I want to storage the sequences from three diferent ORF (marcos de lectura for Spanish). I don't know how to do that. But I do know where is my mistake: the variable $protein only storage the last sequence of the loop, i.e., the third ORF. I need all of them in the screen. PD: The code is very long but first, it's the genetic code of a mitochondrion, and second I get it that way. I'll appreciate any kind of help. Thanks since now.

for ($c=0; $c < 3; $c=$c+1){ for ($position=$c; $position < length $DNA; $position=$position+1) +{ $codon=substr($DNA, $position, 3); #print "$codon\n"; if($codon eq 'TTT' or $codon eq 'TTC'){ #print "el codon $codon codifica para Fenilalanina"; $position=$position+2; $protein=$protein."F"; } elsif($codon eq 'GGT' or $codon eq 'GGC'or $codon eq 'GGA'or $ +codon eq 'GGG'){ #print "el codon $codon codifica para Glicina"; $position=$position+2; $protein=$protein."G"; } elsif($codon eq 'GCT' or $codon eq 'GCC'or $codon eq 'GCA'or $ +codon eq 'GCG'){ #print "el codon $codon codifica para Alanina"; $position=$position+2; $protein=$protein."A"; } elsif($codon eq 'TTA' or $codon eq 'TTG' or $codon eq 'CTT' or + $codon eq 'CTC' or $codon eq 'CTG' or $codon eq 'CTA'){ #print "el codon $codon codifica para Leucina"; $position=$position+2; $protein=$protein."L"; } elsif($codon eq 'GTT' or $codon eq 'GTC'or $codon eq 'GTA'or $ +codon eq 'GTG'){ #print "el codon $codon codifica para Valina"; $position=$position+2; $protein=$protein."V"; } elsif($codon eq 'ATT' or $codon eq 'ATC'or $codon eq 'ATA'){ #print "el codon $codon codifica para Isoleucina"; $position=$position+2; $protein=$protein."I"; } elsif($codon eq 'CCT' or $codon eq 'CCC'or $codon eq 'CCA'or $ +codon eq 'CCG'){ #print "el codon $codon codifica para Prolina"; $position=$position+2; $protein=$protein."P"; } elsif($codon eq 'TCT' or $codon eq 'TCC'or $codon eq 'TCA'or $ +codon eq 'TCG'){ #print "el codon $codon codifica para Serina"; $position=$position+2; $protein=$protein."S"; } elsif($codon eq 'ACT' or $codon eq 'ACC'or $codon eq 'ACA'or $ +codon eq 'ACG'){ #print "el codon $codon codifica para Treonina"; $position=$position+2; $protein=$protein."T"; } elsif($codon eq 'TGT' or $codon eq 'TGC'){ #print "el codon $codon codifica para Cisteina"; $position=$position+2; $protein=$protein."C"; } elsif($codon eq 'TAT' or $codon eq 'TAC'){ #print "el codon $codon codifica para Tirosina"; $position=$position+2; $protein=$protein."Y"; } elsif($codon eq 'AAT' or $codon eq 'AAC'){ #print "el codon $codon codifica para Asparagina"; $position=$position+2; $protein=$protein."N"; } elsif($codon eq 'CAA' or $codon eq 'CAG'){ #print "el codon $codon codifica para Glutamina"; $position=$position+2; $protein=$protein."Q"; } elsif($codon eq 'GAT' or $codon eq 'GAC'){ #print "el codon $codon codifica para Ácido aspártico"; $position=$position+2; $protein=$protein."D"; } elsif($codon eq 'GAA' or $codon eq 'GAG'){ #print "el codon $codon codifica para Ácido glutámico"; $position=$position+2; $protein=$protein."E"; } elsif($codon eq 'CGT' or $codon eq 'CGC' or $codon eq 'CGA' or + $codon eq 'CGG' or $codon eq 'AGA' or $codon eq 'AGG'){ #print "el codon $codon codifica para Arginina"; $position=$position+2; $protein=$protein."R"; } elsif($codon eq 'AAA' or $codon eq 'AAG'){ #print "el codon $codon codifica para Lisina"; $position=$position+2; $protein=$protein."K"; } elsif($codon eq 'CAT' or $codon eq 'CAC'){ #print "el codon $codon codifica para Histidina"; $position=$position+2; $protein=$protein."H"; } elsif($codon eq 'TGG' or $codon eq 'TGA' ){ #print "el codon $codon codifica para Triptofano"; $position=$position+2; $protein=$protein."W"; } elsif($codon eq 'ATG'){ #print "el codon $codon codifica para Metionina"; $position=$position+2; $protein=$protein."M"; #++$M; } elsif($codon eq 'TAA' or $codon eq 'TAG'){ #print "el codon $codon codifica para STOP"; $position=$position+2; $protein=$protein."8"; #++$ZZ; } else { #print "codon $codon no es reconocido \n"; $protein=$protein."x"; #++$x; } } @marcos=<,$protein>; foreach $mar(@marcos){ print "La secuencia ",$c," es: ",$mar,"\n\n"; } }

Replies are listed 'Best First'.
Re: Storage of proteins from diferent ORF
by kennethk (Abbot) on Mar 31, 2014 at 16:53 UTC

    Welcome to the monastery.

    What do you expect the line @marcos=<,$protein>; to do? Perl will read that as a glob, but is fixed content, so it equivalent to @marcos=",$protein";, which will always yield a single value. If I run your code with $DNA equal to AGAAGAAGA, I get the output

    La secuencia 0 es: ,RRR La secuencia 1 es: ,RRREExx La secuencia 2 es: ,RRREExxKKx
    because your $protein variable persists across loops. It's important when you are describing issues in include sample inputs and desired outputs (as described in How do I post a question effectively?) so that language issues don't get in the way of understanding. If the output you were going for is more like:
    La secuencia 0 es: RRR La secuencia 1 es: EExx La secuencia 2 es: KKx
    Then your code should probably look more like:
    use strict; use warnings; chomp(my $DNA = <>); my %acid_map = ( TTT => 'F', TTC => 'F', GGT => 'G', GGC => 'G', GGA => 'G', GGG => 'G', GCT => 'A', GCC => 'A', GCA => 'A', GCG => 'A', TTA => 'L', TTG => 'L', CTT => 'L', CTC => 'L', CTG => 'L', CTA => 'L', GTT => 'V', GTC => 'V', GTA => 'V', GTG => 'V', ATT => 'I', ATC => 'I', ATA => 'I', CCT => 'P', CCC => 'P', CCA => 'P', CCG => 'P', TCT => 'S', TCC => 'S', TCA => 'S', TCG => 'S', ACT => 'T', ACC => 'T', ACA => 'T', ACG => 'T', TGT => 'C', TGC => 'C', TAT => 'Y', TAC => 'Y', AAT => 'N', AAC => 'N', CAA => 'Q', CAG => 'Q', GAT => 'D', GAC => 'D', GAA => 'E', GAG => 'E', CGT => 'R', CGC => 'R', CGA => 'R', CGG => 'R', AGA => 'R', AGG => 'R', AAA => 'K', AAG => 'K', CAT => 'H', CAC => 'H', TGG => 'W', TGA => 'W', ATG => 'M', TAA => '8', TAG => '8', ); foreach my $c (0 .. 2){ my $protein = ''; my $position=$c; while ($position < length $DNA) { my $codon=substr($DNA, $position, 3); if ($acid_map{$codon}) { $position += 3; $protein .= $acid_map{$codon}; } else { $position++; $protein .= 'x'; } } print "La secuencia $c es: $protein\n\n"; }
    where I've made the following changes:
    1. I added strict and warnings; see Use strict warnings and diagnostics or die for some reasons why

    2. Instead of a long list of if-elsif-elses, I've used a hash. It makes the algorithm more immediately legible and will run faster (O(N) vs. O(N^2) for the original).

    3. I used a foreach loop instead of a C-style for loop for $c since you have a fixed series of numbers, and thus no need for complex logic.

    4. I swapped the inner loop to a while loop, since you have variable strides. I also used the opportunity to centralize the strides, so that $position is only changed once per iteration, hopefully improving clarity of intent.

    5. I swapped to compound assignment operators (Assignment Operators) since you always had duplicate variables. Less typing means fewer opportunities for typos, and again makes reading the code more obvious.

    6. Since you were already using interpolating quotes (Quote and Quote like Operators), I removed the unnecessary splits into a list for your output. I also removed the no-op associated with the glob.

    7. Lastly, I removed the loop from around your print. You could also express this as a push to an array scoped outside the foreach loop, and then printing outside the loop.

    Please review this code, and ask me questions about how it works if anything is unclear.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      I had to do some translation from DNA/RNA to amino acids and grabbed this codon table. I found, however, that there are a couple missing amino acids and one or two mistakes. Just want to drop a fixed version here in case people come looking in the future:
      my %codon_table = ( AAA => 'K', AAC => 'N', AAG => 'K', AAT => 'N', ACA => 'T', ACC => 'T', ACG => 'T', ACT => 'T', AGA => 'R', AGC => 'S', AGG => 'R', AGT => 'S', ATA => 'I', ATC => 'I', ATG => 'M', ATT => 'I', CAA => 'Q', CAC => 'H', CAG => 'Q', CAT => 'H', CCA => 'P', CCC => 'P', CCG => 'P', CCT => 'P', CGA => 'R', CGC => 'R', CGG => 'R', CGT => 'R', CTA => 'L', CTC => 'L', CTG => 'L', CTT => 'L', GAA => 'E', GAC => 'D', GAG => 'E', GAT => 'D', GCA => 'A', GCC => 'A', GCG => 'A', GCT => 'A', GGA => 'G', GGC => 'G', GGG => 'G', GGT => 'G', GTA => 'V', GTC => 'V', GTG => 'V', GTT => 'V', TAA => '-', TAC => 'Y', TAG => '-', TAT => 'Y', TCA => 'S', TCC => 'S', TCG => 'S', TCT => 'S', TGA => '-', TGC => 'C', TGG => 'W', TGT => 'C', TTA => 'L', TTC => 'F', TTG => 'L', TTT => 'F', );

      EDIT: Here is one I made that I like more. It is structured like you normally see codon tables in books.

      my %codon_table = ( TTT => 'F', TCT => 'S', TAT => 'Y', TGT => 'C', TTC => 'F', TCC => 'S', TAC => 'Y', TGC => 'C', TTA => 'L', TCA => 'S', TAA => '-', TGA => '-', TTG => 'L', TCG => 'S', TAG => '-', TGG => 'W', CTT => 'L', CCT => 'P', CAT => 'H', CGT => 'R', CTC => 'L', CCC => 'P', CAC => 'H', CGC => 'R', CTA => 'L', CCA => 'P', CAA => 'Q', CGA => 'R', CTG => 'L', CCG => 'P', CAG => 'Q', CGG => 'R', ATT => 'I', ACT => 'T', AAT => 'N', AGT => 'S', ATC => 'I', ACC => 'T', AAC => 'N', AGC => 'S', ATA => 'I', ACA => 'T', AAA => 'K', AGA => 'R', ATG => 'M', ACG => 'T', AAG => 'K', AGG => 'R', GTT => 'V', GCT => 'A', GAT => 'D', GGT => 'G', GTC => 'V', GCC => 'A', GAC => 'D', GGC => 'G', GTA => 'V', GCA => 'A', GAA => 'E', GGA => 'G', GTG => 'V', GCG => 'A', GAG => 'E', GGG => 'G', );
Re: Storage of proteins from diferent ORF
by frozenwithjoy (Priest) on Mar 31, 2014 at 16:35 UTC

    Edit: Looks like kennethk also rewrote your code in almost the same way as I was adding my last part, so it must be the right way to do it! :P

    Edit2: kennethk pointed out to me that my code is incrementing by three positions when there is an unknown codon, but yours only increments by one position. Is this by design? It seems to me like you would want to move on to the next codon regardless of there being a match or mismatch.

    First, your code would be much more simple if you had your codon table in a hash and just looked up the codon in the hash to get the amino acid. Before you entered the loop, you would define the table:

    my %codon_table = ( 'TTT' => 'F', 'TTC' => 'F', 'TTA' => 'L', 'TTG' => 'L', 'CTT' => 'L', 'CTC' => 'L', 'CTA' => 'L', 'CTG' => 'L', # AND SO ON );

    Second, I'm not really familiar with this usage:

    @marcos=<,$protein>;

    I think what you want instead of this line is:

    push @marcos, $protein;

    Then if you move your last little loop outside of the main loop, I think it should behave the way you expect. Here is a simplified version of that last loop:

    print "La secuencia $c es: $_\n\n" for @marcos;

    Third, I suspect that you aren't using use strict; use warnings;... Use these!

    I've combined all of these suggestions and made a few more changes to rewrite your code as:

    my %codon_table = ( 'TTT' => 'F', 'TTC' => 'F', 'TTA' => 'L', 'TTG' => 'L', 'CTT' => 'L', 'CTC' => 'L', 'CTA' => 'L', 'CTG' => 'L', # AND SO ON ); my @marcos; for my $reading_frame ( 0 .. 2 ) { my $position = $reading_frame; while ( $position < length($DNA) - 2 ) { my $codon = substr( $DNA, $position, 3 ); my $amino_acid = $codon_table{$codon}; if ( defined $amino_acid ) { $protein = $protein . $amino_acid; } else { $protein .= "x"; } $position += 3; } push @marcos, $protein; } print "La secuencia $reading_frame es: $_\n\n" for @marcos;