comment on

Hello fellow monks, I hope you are all having a good day. I coded up a short script to translate an RNA sequence (consisting of the letters A, U, G and C) into an amino acid sequence (contains 20 different letters).

There is a "genetic code" that can convert three letters from the RNA sequence (a "codon") to an amino acid (represented as one letter). This code can be found HERE. My script is the following:

use strict; use warnings;

open READER, '<', 'rna.txt';
   chomp(my $rna = <READER>);
close READER;

my %gencode;

open GENCODE, '<', 'code.txt';
   while (<GENCODE> =~ m/^([AUGC]{3}) (\w)?$/m) {
    my $codon = $1;
    my $aa;
    if (defined $2) {
        $aa = $2;
    } else {
    #    $aa = 'STOP';
        $aa = '';    
}
    $gencode{$codon} = $aa;
   }
close GENCODE;

my $protein = $rna;

while ($rna =~ m/([AUGC]{3})/g) {
    my $codon = $1;
    my $aa    = $gencode{$codon};
    $protein =~ s/$codon/$aa/;
}

open RESULTS, '>', 'results.txt';
   print RESULTS "$protein\n";
close RESULTS;
[download]

$rna makes reference to a large RNA string (found here, if you are interested).

When running this script, I start off getting the correct sequence, until I reach position 629 in the protein sequence:

...NEWTAWFLNSPAAGPNQCQIVY...   # Answer I should get

...NEWTAWFLNSPK  PNQCQIVY...   # My answer
[download]

Note that my answer does not have the space. I have included it in order to make the comparison between both strings easier.

Here is what I think has happened: in RNA, the AAG sequence becomes K in a protein sequence. At the same time, each of A, A and G are one-letter codes for amino acids in a protein (each of those letters could have originated from a three-letter code in the original RNA sequence). When Perl was substituting the RNA 3-letter codes with protein 1-letter codes, it mis-interpreted this AAG protein sequence as an RNA sequence and "resubstituted it" again.

I am confused because I thought that the Perl while loop keeps track of where in the sequence it has carried out a substitution, but perhaps this seems not to be the case? I know there are alternative ways of solving this problem (which I have solved using a different method), but I would like to know what is going on with my script. Does anyone know how to prevent Perl from substituting a part of a string that has already been substituted, even if this new part corresponds to the matching criteria in the regex expression? The documentations tells me to use the "c" modifier after the regex, but that does not work.

In reply to [SOLVED] Keeping track of substituted regions with a while loop by enderk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.