Consider a function that given an RNA sequence string, returns a string
representing the corresponding amino acids.
RNA is represented as string of letters A, C, G, and U,
representing the base pairs Adenine, Cytosine, Guanine,
and Uracil respectively. This differs from DNA in that
Uracil replaces Thymine, which is why this is AC GU instead
of the familiar AC GT (i.e. 'GATTACA').
The amino acids are also
represented by a single letter.
As an example, the string 'UUCGAACACUGAG' would be transformed into 'FEH.' and returned.
RNA works such that each group of three "letters" (i.e. base-pairs) corresponds to the use of a particular amino acid, or the STOP sequence which is represented here as a period. If there are one or two extra letters at the end of the sequence, these should be ignored. All input to the function is assumed to contain only the letters A,C,G,U, and nothing else, though the number of characters may be arbitrary.
Below is a reference implementation that is not optimized, and includes comments for the curious:
Update: Typo in the example 'GAG'->'CAC' fixed.
As an example, the string 'UUCGAACACUGAG' would be transformed into 'FEH.' and returned.
RNA works such that each group of three "letters" (i.e. base-pairs) corresponds to the use of a particular amino acid, or the STOP sequence which is represented here as a period. If there are one or two extra letters at the end of the sequence, these should be ignored. All input to the function is assumed to contain only the letters A,C,G,U, and nothing else, though the number of characters may be arbitrary.
Below is a reference implementation that is not optimized, and includes comments for the curious:
Interesting Links: Genetic Code, Golf challange: match U.S. State namessub f { my %g = ( # . - Stop 'UAA'=>'.','UAG'=>'.','UGA'=>'.', # A - Alanine 'GCU'=>'A','GCC'=>'A','GCA'=>'A','GCG'=>'A', # C - Cysteine 'UGU'=>'C','UGC'=>'C', # D - Aspartic Acid 'GAU'=>'D','GAC'=>'D', # E - Glutamic Acid 'GAA'=>'E','GAG'=>'E', # F - Phenylalanine 'UUU'=>'F','UUC'=>'F', # G - Glycine 'GGU'=>'G','GGC'=>'G','GGA'=>'G','GGG'=>'G', # H - Histidine 'CAU'=>'H','CAC'=>'H', # I - Isoleucine 'AUU'=>'I','AUC'=>'I','AUA'=>'I', # K - Lysine 'AAA'=>'K','AAG'=>'K', # L - Leucine 'CUU'=>'L','CUC'=>'L','CUA'=>'L','CUG'=>'L', 'UUA'=>'L','UUG'=>'L', # M - Methionine 'AUG'=>'M', # N - Asparagine 'AAU'=>'N','AAC'=>'N', # P - Proline 'CCU'=>'P','CCC'=>'P','CCA'=>'P','CCG'=>'P', # Q - Glutamine 'CAA'=>'Q','CAG'=>'Q', # R - Arginine 'CGU'=>'R','CGC'=>'R','CGA'=>'R','CGG'=>'R', 'AGA'=>'R','AGG'=>'R', # S - Serine 'UCU'=>'S','UCC'=>'S','UCA'=>'S','UCG'=>'S', 'AGU'=>'S','AGC'=>'S', # T - Threonine 'ACU'=>'T','ACC'=>'T','ACA'=>'T','ACG'=>'T', # V - Valine 'GUU'=>'V','GUC'=>'V','GUA'=>'V','GUG'=>'V', # W - Tryptophan 'UGG'=>'W', # Y - Tyrosine 'UAU'=>'Y','UAC'=>'Y', ); $_=pop;s/.{1,3}/$g{$&}/g;$_ } print f("ACCCACAUUUCAUAAAUAUCCCCUGAGCGGCUCUGAGGGCAACUGUUCUAAUC");
Update: Typo in the example 'GAG'->'CAC' fixed.
Back to
Meditations