(Golf) RNA Genetic Code Translator

Consider a function that given an RNA sequence string, returns a string representing the corresponding amino acids. RNA is represented as string of letters A, C, G, and U, representing the base pairs Adenine, Cytosine, Guanine, and Uracil respectively. This differs from DNA in that Uracil replaces Thymine, which is why this is AC GU instead of the familiar AC GT (i.e. 'GATTACA'). The amino acids are also represented by a single letter.

As an example, the string 'UUCGAACACUGAG' would be transformed into 'FEH.' and returned.

RNA works such that each group of three "letters" (i.e. base-pairs) corresponds to the use of a particular amino acid, or the STOP sequence which is represented here as a period. If there are one or two extra letters at the end of the sequence, these should be ignored. All input to the function is assumed to contain only the letters A,C,G,U, and nothing else, though the number of characters may be arbitrary.

Below is a reference implementation that is not optimized, and includes comments for the curious:

sub f {
my %g = (
        # . - Stop
        'UAA'=>'.','UAG'=>'.','UGA'=>'.',
        # A - Alanine
        'GCU'=>'A','GCC'=>'A','GCA'=>'A','GCG'=>'A',
        # C - Cysteine
        'UGU'=>'C','UGC'=>'C',
        # D - Aspartic Acid
        'GAU'=>'D','GAC'=>'D',
        # E - Glutamic Acid
        'GAA'=>'E','GAG'=>'E',
        # F - Phenylalanine
        'UUU'=>'F','UUC'=>'F',
        # G - Glycine
        'GGU'=>'G','GGC'=>'G','GGA'=>'G','GGG'=>'G',
        # H - Histidine
        'CAU'=>'H','CAC'=>'H',
        # I - Isoleucine
        'AUU'=>'I','AUC'=>'I','AUA'=>'I',
        # K - Lysine
        'AAA'=>'K','AAG'=>'K',
        # L - Leucine
        'CUU'=>'L','CUC'=>'L','CUA'=>'L','CUG'=>'L',
        'UUA'=>'L','UUG'=>'L',
        # M - Methionine
        'AUG'=>'M',
        # N - Asparagine
        'AAU'=>'N','AAC'=>'N',
        # P - Proline
        'CCU'=>'P','CCC'=>'P','CCA'=>'P','CCG'=>'P',
        # Q - Glutamine
        'CAA'=>'Q','CAG'=>'Q',
        # R - Arginine
        'CGU'=>'R','CGC'=>'R','CGA'=>'R','CGG'=>'R',
        'AGA'=>'R','AGG'=>'R',
        # S - Serine
        'UCU'=>'S','UCC'=>'S','UCA'=>'S','UCG'=>'S',
        'AGU'=>'S','AGC'=>'S',
        # T - Threonine
        'ACU'=>'T','ACC'=>'T','ACA'=>'T','ACG'=>'T',
        # V - Valine
        'GUU'=>'V','GUC'=>'V','GUA'=>'V','GUG'=>'V',
        # W - Tryptophan
        'UGG'=>'W',
        # Y - Tyrosine
        'UAU'=>'Y','UAC'=>'Y',
);
$_=pop;s/.{1,3}/$g{$&}/g;$_
}

print f("ACCCACAUUUCAUAAAUAUCCCCUGAGCGGCUCUGAGGGCAACUGUUCUAAUC");
[download]

Interesting Links: Genetic Code, Golf challange: match U.S. State names

Update: Typo in the example 'GAG'->'CAC' fixed.

Back to Meditations