Hello fellow monks, I hope you are all having a good day. I coded up a short script to translate an RNA sequence (consisting of the letters A, U, G and C) into an amino acid sequence (contains 20 different letters).
There is a "genetic code" that can convert three letters from the RNA sequence (a "codon") to an amino acid (represented as one letter). This code can be found HERE. My script is the following:
use strict; use warnings; open READER, '<', 'rna.txt'; chomp(my $rna = <READER>); close READER; my %gencode; open GENCODE, '<', 'code.txt'; while (<GENCODE> =~ m/^([AUGC]{3}) (\w)?$/m) { my $codon = $1; my $aa; if (defined $2) { $aa = $2; } else { # $aa = 'STOP'; $aa = ''; } $gencode{$codon} = $aa; } close GENCODE; my $protein = $rna; while ($rna =~ m/([AUGC]{3})/g) { my $codon = $1; my $aa = $gencode{$codon}; $protein =~ s/$codon/$aa/; } open RESULTS, '>', 'results.txt'; print RESULTS "$protein\n"; close RESULTS;
$rna makes reference to a large RNA string (found here, if you are interested).
When running this script, I start off getting the correct sequence, until I reach position 629 in the protein sequence:
...NEWTAWFLNSPAAGPNQCQIVY... # Answer I should get ...NEWTAWFLNSPK PNQCQIVY... # My answer
Note that my answer does not have the space. I have included it in order to make the comparison between both strings easier.
Here is what I think has happened: in RNA, the AAG sequence becomes K in a protein sequence. At the same time, each of A, A and G are one-letter codes for amino acids in a protein (each of those letters could have originated from a three-letter code in the original RNA sequence). When Perl was substituting the RNA 3-letter codes with protein 1-letter codes, it mis-interpreted this AAG protein sequence as an RNA sequence and "resubstituted it" again.
I am confused because I thought that the Perl while loop keeps track of where in the sequence it has carried out a substitution, but perhaps this seems not to be the case? I know there are alternative ways of solving this problem (which I have solved using a different method), but I would like to know what is going on with my script. Does anyone know how to prevent Perl from substituting a part of a string that has already been substituted, even if this new part corresponds to the matching criteria in the regex expression? The documentations tells me to use the "c" modifier after the regex, but that does not work.
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |