Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

(Golf) RNA Genetic Code Translator

by tadman (Prior)
on Jul 06, 2001 at 00:36 UTC ( [id://94263]=perlmeditation: print w/replies, xml ) Need Help??

Consider a function that given an RNA sequence string, returns a string representing the corresponding amino acids. RNA is represented as string of letters A, C, G, and U, representing the base pairs Adenine, Cytosine, Guanine, and Uracil respectively. This differs from DNA in that Uracil replaces Thymine, which is why this is AC GU instead of the familiar AC GT (i.e. 'GATTACA'). The amino acids are also represented by a single letter.

As an example, the string 'UUCGAACACUGAG' would be transformed into 'FEH.' and returned.

RNA works such that each group of three "letters" (i.e. base-pairs) corresponds to the use of a particular amino acid, or the STOP sequence which is represented here as a period. If there are one or two extra letters at the end of the sequence, these should be ignored. All input to the function is assumed to contain only the letters A,C,G,U, and nothing else, though the number of characters may be arbitrary.


Below is a reference implementation that is not optimized, and includes comments for the curious:
sub f { my %g = ( # . - Stop 'UAA'=>'.','UAG'=>'.','UGA'=>'.', # A - Alanine 'GCU'=>'A','GCC'=>'A','GCA'=>'A','GCG'=>'A', # C - Cysteine 'UGU'=>'C','UGC'=>'C', # D - Aspartic Acid 'GAU'=>'D','GAC'=>'D', # E - Glutamic Acid 'GAA'=>'E','GAG'=>'E', # F - Phenylalanine 'UUU'=>'F','UUC'=>'F', # G - Glycine 'GGU'=>'G','GGC'=>'G','GGA'=>'G','GGG'=>'G', # H - Histidine 'CAU'=>'H','CAC'=>'H', # I - Isoleucine 'AUU'=>'I','AUC'=>'I','AUA'=>'I', # K - Lysine 'AAA'=>'K','AAG'=>'K', # L - Leucine 'CUU'=>'L','CUC'=>'L','CUA'=>'L','CUG'=>'L', 'UUA'=>'L','UUG'=>'L', # M - Methionine 'AUG'=>'M', # N - Asparagine 'AAU'=>'N','AAC'=>'N', # P - Proline 'CCU'=>'P','CCC'=>'P','CCA'=>'P','CCG'=>'P', # Q - Glutamine 'CAA'=>'Q','CAG'=>'Q', # R - Arginine 'CGU'=>'R','CGC'=>'R','CGA'=>'R','CGG'=>'R', 'AGA'=>'R','AGG'=>'R', # S - Serine 'UCU'=>'S','UCC'=>'S','UCA'=>'S','UCG'=>'S', 'AGU'=>'S','AGC'=>'S', # T - Threonine 'ACU'=>'T','ACC'=>'T','ACA'=>'T','ACG'=>'T', # V - Valine 'GUU'=>'V','GUC'=>'V','GUA'=>'V','GUG'=>'V', # W - Tryptophan 'UGG'=>'W', # Y - Tyrosine 'UAU'=>'Y','UAC'=>'Y', ); $_=pop;s/.{1,3}/$g{$&}/g;$_ } print f("ACCCACAUUUCAUAAAUAUCCCCUGAGCGGCUCUGAGGGCAACUGUUCUAAUC");
Interesting Links: Genetic Code, Golf challange: match U.S. State names
Update: Typo in the example 'GAG'->'CAC' fixed.

Replies are listed 'Best First'.
Re: (Golf) RNA Genetic Code Translator
by MeowChow (Vicar) on Jul 06, 2001 at 02:45 UTC
    Here's a swing, strict at 222:
    sub f { my@r=qw(UA[AG]|UGA GC. - UG[UC] GA[UC] GA[AG] UU[UC] GG. CA[UC] AU[^ +G] - AA[AG] CU.|UU[AG] AUG AA[UC] - CC. CA[AG] CG.|AG[AG] UC.|AG[UC] AC. - GU. UGG - UA[UC] +); ((my$t=pop)=~s|...|chr 64+(grep$&=~/$r[$_]/,0..25)[0]|eg);$t=~y/@/./ +;$t }
    update: I can't count, the one above is actually 232. And as no_slogan points out, it's sometimes helpful to read the spec. 238 chars:
    sub f { my@r=qw(UA[AG]|UGA GC. - UG[UC] GA[UC] GA[AG] UU[UC] GG. CA[UC] AU[^ +G] - AA[AG] CU.|UU[AG] AUG AA[UC] - CC. CA[AG] CG.|AG[AG] UC.|AG[UC] +AC. - GU. UGG - UA[UC] ^); ((my$t=pop)=~s|..?.?|chr 64+(grep$&=~/$r[$_]/,0..26)[0]|eg);$t=~y/@Z +/./d;$t }
       MeowChow                                   
                   s aamecha.s a..a\u$&owag.print
      That's a pretty nifty way to do the encoding. It doesn't eliminate trailing characters that aren't part of a group of three, though.
Re: (Golf) RNA Genetic Code Translator
by japhy (Canon) on Jul 06, 2001 at 01:30 UTC
    There's no whitespace here, except for newlines (which I've decided not to count, although they've been placed to my advantage. It's 418 chars. I love my hash slices.
    sub B(){''}sub Z(){(B)x13}sub U(){(B)x31}sub O(){(B)x83}sub J(){(B)x343}sub b(){B,B,B}@g{AAA..UUU}=(K,B,N,b,K,Z,N,U,T,B,T,b,T,Z,T,O,R,B,S,b,R,Z,S, +J, I,B,I,b,M,Z,I,(B)x811,Q,B,H,b,Q,Z,H,U,P,B,P,b,P,Z,P,O,R,B,R,b,R,Z,R,J, +L, B,L,b,L,Z,L,(B)x2163,E,B,D,b,E,Z,D,U,A,B,A,b,A,Z,A,O,G,B,G,b,G,Z,G,J,V +,B ,V,b,V,Z,V,(B)x8923,'.',B,Y,b,'.',Z,Y,U,S,B,S,b,S,Z,S,O,'.',B,C,b,W,Z, +C, J,L,B,F,b,L,Z,F);sub f{$_=pop;s/..?.?/$g{$&}/g;$_}


    japhy -- Perl and Regex Hacker
Re: (Golf) RNA Genetic Code Translator
by no_slogan (Deacon) on Jul 06, 2001 at 02:48 UTC
    130 characters:
    $_="KNNKtIIIMRSSRQHHQplr.YY.sLFFL.CCWEDDEavg";s/[a-z]/uc$&x4/eg;@x=/./ +g;join"",@x[map{$x=0;$x=$x*4|6&ord for/./g;$x/2}pop=~/.../g]
      I can shave 4 chars off of that:
      $_="KNNKtIIIMRSSRQHHQplr.YY.sLFFL.CCWEDDEavg";s/[a-z]/uc$&x4/eg; join"",(/./g)[map{$x=0;$x=$x*4|6&ord for/./g;$x/2}pop=~/.../g]

      The 15 year old, freshman programmer,
      Stephen Rawls

      Bending the spec a bit at 123 (regarding treatment of leftover base pairs):
      sub f { $_=pop;y/ACUG/0123/;s|(.)(.)(.)|(map{ord>91?uc:(),uc} 'KnKttiIMRsRQhQppllrr.y.ssLfL.cWEdEaavvgg'=~/./g)[$1*16+$2*4+$3]|eg;$_ }
         MeowChow                                   
                     s aamecha.s a..a\u$&owag.print
        Stunning, to say the least, but what is more stunning is the amateurish oversight that I made myself when posting my entry. How could I have not used the range feature of tr? I feel silly, but at least I'm not alone:
        sub f { $_=pop;y/ACUG/0-3/;s|(.)(.)(.)|(map{ord>91?uc:(),uc} 'KnKttiIMRsRQhQppllrr.y.ssLfL.cWEdEaavvgg'=~/./g)[$1*16+$2*4+$3]|eg;$_ }
        I was looking at my entry, trying to save a few strokes, motivated by scain's Benchmarks posted below. It was immediately obvious how to save a few strokes, now that I'm awake and caffinated and all. Revised, mine ended up at 133, still a ways off of MeowChow at the new and improved 122 posted above:
        sub f{ $_=pop;y/UCAG/0-3/;s/(.)(.)(.)/substr "FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG", $1<<4|$2*4|$3,1/ge;s/\d//g;$_ }
Re: (Golf) RNA Genetic Code Translator
by tadman (Prior) on Jul 06, 2001 at 13:54 UTC
    My first real attempt came in at 137 characters, not including cosmetic linebreaks:
    sub f{ $_=pop;y/UCAG/0123/;s/(.)(.)(.)/substr "FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG" ,$1<<4|$2<<2|$3,1/ge;y/0123//d;$_ }
    Which I thought was pretty decent, but it's a little behind the times. Strangely when I use the simple, but elegant compression technique introduced by no_slogan, this code expands. I'll have to look into that more.
Re: (Golf) RNA Genetic Code Translator
by japhy (Canon) on Jul 06, 2001 at 00:48 UTC
    Are you asking us to golf the entire function, or just the substitution part? Here's a small savings:
    sub RNA { # hash here $_=pop;s/..?.?/$g{$&}/g;$_ }
    The hash will take me some more time. I'll do that later.

    japhy -- Perl and Regex Hacker
      When I saw that there was already a reply I figured someone pulled a use RNA out of their back pocket.

      The reference function has the hash in it, so yes, any compatible one would also have to, though presumably in a more compact format, which you could obtain by using the reference function hash as input data. Saves typing it in yourself and all that.
Re: (Golf) RNA Genetic Code Translator
by scain (Curate) on Jul 06, 2001 at 17:50 UTC
    What would be more interesting than character golf in this instance is time golf, ie, which of these implementations runs fastest? A function like this can be called often in the course of bioinformatics analysis, so you want it to be responsive.

    If I have time today (right :-) maybe I will do some benchmarks.

    Scott

Re: (Golf) RNA Genetic Code Translator
by tachyon (Chancellor) on Jul 06, 2001 at 03:01 UTC

    Here's mine at 310 strokes - a whole tournament an still over par!

    sub RNA { @_{'UAAUAGUGAGCUGCCGCAGCGUGUUGCGAUGACGAAGAGUUUUUCGGUGGCGGAGGGCAUCACAUU +AUCAUAAAAAAGCUUCUCCUACUGUUAUUGAUGAAUAACCCUCCCCCACCGCAACAGCGUCGCCGACGG +AGAAGGUCUUCCUCAUCGAGUAGCACUACCACAACGGUUGUCGUAGUGUGGUAUUAC'=~/(...)/g} +=split//,'...AAAACCDDEEFFGGGGHHIIIKKLLLLLLMNNPPPPQQRRRRRRSSSSSSTTTTVV +VVWYY';$_=pop;s/..?.?/$_{$&}/g;$_ }

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: (Golf) RNA Genetic Code Translator
by srawls (Friar) on Jul 06, 2001 at 03:45 UTC
    As an example, the string 'UUCGAAGAGUGAG' would be transformed into 'FEH.' and returned.

    I hope this is a typo, cause your hash makes it: 'FEE.'

    The 15 year old, freshman programmer,
    Stephen Rawls

Re: (Golf) RNA Genetic Code Translator
by scain (Curate) on Jul 06, 2001 at 21:04 UTC
    update: DNA, RNA what's the difference? My original code used the cDNA, not the mRNA. I changed it and reran it, and everyone's code now works except for japhy's.

    OK, this is going to be a long one...

    I was going to benchmark these golf examples to see which one was fastest, but there seems to be some cheating going on. Honestly, I don't really understand what any of these is doing, so I don't know if the cheating was intentional or not. To do the benchmarking, was was going to use the CFTR mRNA (that is the protein that, when mutated, causes cystic fibrosis). The mRNA (with leading and trailing sequence removed) is in the __DATA__ section of the code. The correct translation looks like this:

    MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLEREWDRELASKKNPKLI +NALRRCFFWRFMFYGIFLYLGEVTKAVQPLLLGRIIASYDPDNKEERSIAIYLGIGLCLLFIVRTLLLH +PAIFGLHHIGMQMRIAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFVWIAPLQV +ALLMGLIWELLQASAFCGLGFLIVLALFQAGLGRMMMKYRDQRAGKISERLVITSEMIENIQSVKAYCW +EEAMEKMIENLRQTELKLTRKAAYVRYFNSSAFFFSGFFVVFLSVLPYALIKGIILRKIFTTISFCIVL +RMAVTRQFPWAVQTWYDSLGAINKIQDFLQKQEYKTLEYNLTTTEVVMENVTAFWEEGFGELFEKAKQN +NNNRKTSNGDDSLFFSNFSLLGTPVLKDINFKIERGQLLAVAGSTGAGKTSLLMMIMGELEPSEGKIKH +SGRISFCSQFSWIMPGTIKENIIFGVSYDEYRYRSVIKACQLEEDISKFAEKDNIVLGEGGITLSGGQR +ARISLARAVYKDADLYLLDSPFGYLDVLTEKEIFESCVCKLMANKTRILVTSKMEHLKKADKILILNEG +SSYFYGTFSELQNLQPDFSSKLMGCDSFDQFSAERRNSILTETLHRFSLEGDAPVSWTETKKQSFKQTG +EFGEKRKNSILNPINSIRKFSIVQKTPLQMNGIEEDSDEPLERRLSLVPDSEQGEAILPRISVISTGPT +LQARRRQSVLNLMTHSVNQGQNIHRKTTASTRKVSLAPQANLTELDIYSRRLSQETGLEISEEINEEDL +KECLFDDMESIPAVTTWNTYLRYITVHKSLIFVLIWCLVIFLAEVAASLVVLWLLGNTPLQDKGNSTHS +RNNSYAVIITSTSSYYVFYIYVGVADTLLAMGFFRGLPLVHTLITVSKILHHKMLHSVLQAPMSTLNTL +KAGGILNRFSKDIAILDDLLPLTIFDFIQLLLIVIGAIAVVAVLQPYIFVATVPVIVAFIMLRAYFLQT +SQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFGRQPYFETLFHKALNLHTANWFLYLSTLRWFQMRIEM +IFVIFFIAVTFISILTTGEGEGRVGIILTLAMNIMSTLQWAVNSSIDVDSLMRSVSRVFKFIDMPTEGK +PTKSTKPYKNGQLSKVMIIENSHVKKDDIWPSGGQMTVKDLTAKYTEGGNAILENISFSISPGQRVGLL +GRTGSGKSTLLSAFLRLLNTEGEIQIDGVSWDSITLQQWRKAFGVIPQKVFIFSGTFRKNLDPYEQWSD +QEIWKVADEVGLRSVIEQFPGKLDFVLVDGGCVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPVTY +QIIRRTLKQAFADCTVILCEHRIEAMLECQQFLVIEENKVRQYDSIQKLLNERSLFRQAISPSDRVKLF +PHRNSSKCKSKPQIAALKEETEEEVQDTRL.
    However, tachyon's, MeowChow's and tadman's orignal codes all gave this:
    QRPEKASKSTRPRKGRQREDQPADEKEREREAKKPKARRRGGETKAQPGRADPNKEERAGGRTHPAGHGQ +RAKKTKSRKGQNNNKEGAAAPQAGEQAAGGAQAGGRKRQRAGKERTEEQKAEEAEKENRQTEKTRKAAR +SAGPAKGRKTTRATRQPAQTDGANKQQKQEKTENTTTEETAEEGGEEKAKQNNRKTGDSGTPKKERGQA +AGTGAGKTGEEPEGKKHGRQPGTKEGERRSKAQEEDKAEKDGEGGTGGQRARARAKADPGTEKEESKAN +KTRTKEKKADKEGSSGTEQQPDSKGDQAERRTETHREGAPTETKKQKQTGEGEKRKPNRKQKTPQGEEE +PERRPEQGEAPRSSTGPTQARRRQNTHNQGQNHRKTTATRKAPQANTERRQETGEEENEEDKEESPATT +NTRTHKSAEAAGNTPQDKGTRNSATSTGADTAGRGPTTKHHKQAPTNTKAGGRKADPTDQGAAAQPATP +ARAQTQQKQEEGRPTTSKGTRAGRQPETHKATANTRQREATTTGEGEGRGTATQANSSRSRKDPTEGKP +TKTKPKGQKEHKKDPGGQTKTAKTEGGAENPGQRGGRTGGKTARNTEGEQGTQQRKAGPQKGTRKNPEQ +QEKAEGREQPGKDGGSGHKQARKAKEPAPTQRRTKQAATEHREAEQQEENKRQQKNERSRQASPDRKPH +RNSKKKPQAAKEETEEEQTR
    It is not at all clear to me why, and it is not at all related to CFTR. For that matter, it's not related to any protein in public databases. Congradulations, you did gene discovery; pharamceutical companies spent billions of dollars to do that :-) Also, japhy's code returns nothing (except some line feeds apparently).

    So, can anyone point out the problems with these subs? I copied them directly from the html, and only removed "+" at the beginning of code wrapped lines, and changed the name of the subs. Here is my code:

    #!/usr/bin/perl while (<DATA>) { $cftr=$_; } print "tadman original\n".f0($cftr)."\n\n"; print "japhy\n".f1($cftr)."\n\n"; print "MeowChow\n".f2($cftr)."\n\n"; print "no_slogan\n".f3($cftr)."\n\n"; print "srawls\n".f4($cftr)."\n\n"; print "tachyon\n".RNA($cftr)."\n\n"; print "tadman golf\n".f5($cftr)."\n\n"; sub f0 { # orginal by tadman my %g = ( # . - Stop 'UAA'=>'.','UAG'=>'.','UGA'=>'.', # A - Alanine 'GCU'=>'A','GCC'=>'A','GCA'=>'A','GCG'=>'A', # C - Cysteine 'UGU'=>'C','UGC'=>'C', # D - Aspartic Acid 'GAU'=>'D','GAC'=>'D', # E - Glutamic Acid 'GAA'=>'E','GAG'=>'E', # F - Phenylalanine 'UUU'=>'F','UUC'=>'F', # G - Glycine 'GGU'=>'G','GGC'=>'G','GGA'=>'G','GGG'=>'G', # H - Histidine 'CAU'=>'H','CAC'=>'H', # I - Isoleucine 'AUU'=>'I','AUC'=>'I','AUA'=>'I', # K - Lysine 'AAA'=>'K','AAG'=>'K', # L - Leucine 'CUU'=>'L','CUC'=>'L','CUA'=>'L','CUG'=>'L', 'UUA'=>'L','UUG'=>'L', # M - Methionine 'AUG'=>'M', # N - Asparagine 'AAU'=>'N','AAC'=>'N', # P - Proline 'CCU'=>'P','CCC'=>'P','CCA'=>'P','CCG'=>'P', # Q - Glutamine 'CAA'=>'Q','CAG'=>'Q', # R - Arginine 'CGU'=>'R','CGC'=>'R','CGA'=>'R','CGG'=>'R', 'AGA'=>'R','AGG'=>'R', # S - Serine 'UCU'=>'S','UCC'=>'S','UCA'=>'S','UCG'=>'S', 'AGU'=>'S','AGC'=>'S', # T - Threonine 'ACU'=>'T','ACC'=>'T','ACA'=>'T','ACG'=>'T', # V - Valine 'GUU'=>'V','GUC'=>'V','GUA'=>'V','GUG'=>'V', # W - Tryptophan 'UGG'=>'W', # Y - Tyrosine 'UAU'=>'Y','UAC'=>'Y', ); $_=pop;s/.{1,3}/$g{$&}/g;$_ } sub #japhy B(){''}sub Z(){(B)x13}sub U(){(B)x31}sub O(){(B)x83}sub J(){(B)x343}sub b(){B,B,B}@g{AAA..UUU}=(K,B,N,b,K,Z,N,U,T,B,T,b,T,Z,T,O,R,B,S, +b,R,Z,S,J, I,B,I,b,M,Z,I,(B)x811,Q,B,H,b,Q,Z,H,U,P,B,P,b,P,Z,P,O,R,B,R,b, +R,Z,R,J,L, B,L,b,L,Z,L,(B)x2163,E,B,D,b,E,Z,D,U,A,B,A,b,A,Z,A,O,G,B,G,b,G +,Z,G,J,V,B ,V,b,V,Z,V,(B)x8923,'.',B,Y,b,'.',Z,Y,U,S,B,S,b,S,Z,S,O,'.',B, +C,b,W,Z,C, J,L,B,F,b,L,Z,F);sub f1{$_=pop;s/..?.?/$g{$&}/g;$_} sub f2{ #MeowChow my@r=qw(UA[AG]|UGA GC. - UG[UC] GA[UC] GA[AG] UU[UC] GG. CA[UC] AU[^G] + - AA[AG] CU.|UU[AG] AUG AA[UC] - CC. CA[AG] CG.|AG[AG] UC.|AG[UC] AC +. - GU. UGG - UA[UC] ^); ((my$t=pop)=~s|..?.?|chr 64+(grep$&=~/$r[$_]/,0..26)[0]|eg);$t=~y/@Z/. +/d;$t } sub f3 { #no_slogan $_="KNNKtIIIMRSSRQHHQplr.YY.sLFFL.CCWEDDEavg";s/[a-z]/uc$&x4/eg;@x=/./ +g;join"",@x[map{$x=0;$x=$x*4|6&ord for/./g;$x/2}pop=~/.../g] } sub f4 { #srawls $_="KNNKtIIIMRSSRQHHQplr.YY.sLFFL.CCWEDDEavg";s/[a-z]/uc$&x4/eg; join"",(/./g)[map{$x=0;$x=$x*4|6&ord for/./g;$x/2}pop=~/.../g] } sub RNA { #tachyon @_{'UAAUAGUGAGCUGCCGCAGCGUGUUGCGAUGACGAAGAGUUUUUCGGUGGCGGAGGGCAUCACAUU +AUCAUAAAAAAGCUUCUCCUACUGUUAUUGAUGAAUAACCCUCCCCCACCGCAACAGCGUCGCCGACGG +AGAAGG UCUUCCUCAUCGAGUAGCACUACCACAACGGUUGUCGUAGUGUGGUAUUAC'=~/(...)/g}=split/ +/,'...AAAACCDDEEFFGGGGHHIIIKKLLLLLLMNNPPPPQQRRRRRRSSSSSSTTTTVVVVWYY'; +$_=pop ;s/..?.?/$_{$&}/g;$_ } sub f5{ #tadman $_=pop;y/UCAG/0123/;s/(.)(.)(.)/substr "FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG" ,$1<<4|$2<<2|$3,1/ge;y/0123//d;$_ } #>gi|6995995|ref|NM_000492.2| Homo sapiens cystic fibrosis transmembra +ne conductance regulator, ATP-binding cassette (sub-family C, member +7) (CF TR), mRNA __DATA__ AUGCAGAGGUCGCCUCUGGAAAAGGCCAGCGUUGUCUCCAAACUUUUUUUCAGCUGGACCAGACCAAUUU +UGAGGAAAGGAUACAGACAGCGCCUGGAAUUGUCAGACAUAUACCAAAUCCCUUCUGUUGAUUCUGCUG +ACAAUCUAUCUGAAAAAUUGGAAAGAGAAUGGGAUAGAGAGCUGGCUUCAAAGAAAAAUCCUAAACUCA +UUAAUGCCCUUCGGCGAUGUUUUUUCUGGAGAUUUAUGUUCUAUGGAAUCUUUUUAUAUUUAGGGGAAG +UCACCAAAGCAGUACAGCCUCUCUUACUGGGAAGAAUCAUAGCUUCCUAUGACCCGGAUAACAAGGAGG +AACGCUCUAUCGCGAUUUAUCUAGGCAUAGGCUUAUGCCUUCUCUUUAUUGUGAGGACACUGCUCCUAC +ACCCAGCCAUUUUUGGCCUUCAUCACAUUGGAAUGCAGAUGAGAAUAGCUAUGUUUAGUUUGAUUUAUA +AGAAGACUUUAAAGCUGUCAAGCCGUGUUCUAGAUAAAAUAAGUAUUGGACAACUUGUUAGUCUCCUUU +CCAACAACCUGAACAAAUUUGAUGAAGGACUUGCAUUGGCACAUUUCGUGUGGAUCGCUCCUUUGCAAG +UGGCACUCCUCAUGGGGCUAAUCUGGGAGUUGUUACAGGCGUCUGCCUUCUGUGGACUUGGUUUCCUGA +UAGUCCUUGCCCUUUUUCAGGCUGGGCUAGGGAGAAUGAUGAUGAAGUACAGAGAUCAGAGAGCUGGGA +AGAUCAGUGAAAGACUUGUGAUUACCUCAGAAAUGAUUGAAAAUAUCCAAUCUGUUAAGGCAUACUGCU +GGGAAGAAGCAAUGGAAAAAAUGAUUGAAAACUUAAGACAAACAGAACUGAAACUGACUCGGAAGGCAG +CCUAUGUGAGAUACUUCAAUAGCUCAGCCUUCUUCUUCUCAGGGUUCUUUGUGGUGUUUUUAUCUGUGC +UUCCCUAUGCACUAAUCAAAGGAAUCAUCCUCCGGAAAAUAUUCACCACCAUCUCAUUCUGCAUUGUUC +UGCGCAUGGCGGUCACUCGGCAAUUUCCCUGGGCUGUACAAACAUGGUAUGACUCUCUUGGAGCAAUAA +ACAAAAUACAGGAUUUCUUACAAAAGCAAGAAUAUAAGACAUUGGAAUAUAACUUAACGACUACAGAAG +UAGUGAUGGAGAAUGUAACAGCCUUCUGGGAGGAGGGAUUUGGGGAAUUAUUUGAGAAAGCAAAACAAA +ACAAUAACAAUAGAAAAACUUCUAAUGGUGAUGACAGCCUCUUCUUCAGUAAUUUCUCACUUCUUGGUA +CUCCUGUCCUGAAAGAUAUUAAUUUCAAGAUAGAAAGAGGACAGUUGUUGGCGGUUGCUGGAUCCACUG +GAGCAGGCAAGACUUCACUUCUAAUGAUGAUUAUGGGAGAACUGGAGCCUUCAGAGGGUAAAAUUAAGC +ACAGUGGAAGAAUUUCAUUCUGUUCUCAGUUUUCCUGGAUUAUGCCUGGCACCAUUAAAGAAAAUAUCA +UCUUUGGUGUUUCCUAUGAUGAAUAUAGAUACAGAAGCGUCAUCAAAGCAUGCCAACUAGAAGAGGACA +UCUCCAAGUUUGCAGAGAAAGACAAUAUAGUUCUUGGAGAAGGUGGAAUCACACUGAGUGGAGGUCAAC +GAGCAAGAAUUUCUUUAGCAAGAGCAGUAUACAAAGAUGCUGAUUUGUAUUUAUUAGACUCUCCUUUUG +GAUACCUAGAUGUUUUAACAGAAAAAGAAAUAUUUGAAAGCUGUGUCUGUAAACUGAUGGCUAACAAAA +CUAGGAUUUUGGUCACUUCUAAAAUGGAACAUUUAAAGAAAGCUGACAAAAUAUUAAUUUUGAAUGAAG +GUAGCAGCUAUUUUUAUGGGACAUUUUCAGAACUCCAAAAUCUACAGCCAGACUUUAGCUCAAAACUCA +UGGGAUGUGAUUCUUUCGACCAAUUUAGUGCAGAAAGAAGAAAUUCAAUCCUAACUGAGACCUUACACC +GUUUCUCAUUAGAAGGAGAUGCUCCUGUCUCCUGGACAGAAACAAAAAAACAAUCUUUUAAACAGACUG +GAGAGUUUGGGGAAAAAAGGAAGAAUUCUAUUCUCAAUCCAAUCAACUCUAUACGAAAAUUUUCCAUUG +UGCAAAAGACUCCCUUACAAAUGAAUGGCAUCGAAGAGGAUUCUGAUGAGCCUUUAGAGAGAAGGCUGU +CCUUAGUACCAGAUUCUGAGCAGGGAGAGGCGAUACUGCCUCGCAUCAGCGUGAUCAGCACUGGCCCCA +CGCUUCAGGCACGAAGGAGGCAGUCUGUCCUGAACCUGAUGACACACUCAGUUAACCAAGGUCAGAACA +UUCACCGAAAGACAACAGCAUCCACACGAAAAGUGUCACUGGCCCCUCAGGCAAACUUGACUGAACUGG +AUAUAUAUUCAAGAAGGUUAUCUCAAGAAACUGGCUUGGAAAUAAGUGAAGAAAUUAACGAAGAAGACU +UAAAGGAGUGCCUUUUUGAUGAUAUGGAGAGCAUACCAGCAGUGACUACAUGGAACACAUACCUUCGAU +AUAUUACUGUCCACAAGAGCUUAAUUUUUGUGCUAAUUUGGUGCUUAGUAAUUUUUCUGGCAGAGGUGG +CUGCUUCUUUGGUUGUGCUGUGGCUCCUUGGAAACACUCCUCUUCAAGACAAAGGGAAUAGUACUCAUA +GUAGAAAUAACAGCUAUGCAGUGAUUAUCACCAGCACCAGUUCGUAUUAUGUGUUUUACAUUUACGUGG +GAGUAGCCGACACUUUGCUUGCUAUGGGAUUCUUCAGAGGUCUACCACUGGUGCAUACUCUAAUCACAG +UGUCGAAAAUUUUACACCACAAAAUGUUACAUUCUGUUCUUCAAGCACCUAUGUCAACCCUCAACACGU +UGAAAGCAGGUGGGAUUCUUAAUAGAUUCUCCAAAGAUAUAGCAAUUUUGGAUGACCUUCUGCCUCUUA +CCAUAUUUGACUUCAUCCAGUUGUUAUUAAUUGUGAUUGGAGCUAUAGCAGUUGUCGCAGUUUUACAAC +CCUACAUCUUUGUUGCAACAGUGCCAGUGAUAGUGGCUUUUAUUAUGUUGAGAGCAUAUUUCCUCCAAA +CCUCACAGCAACUCAAACAACUGGAAUCUGAAGGCAGGAGUCCAAUUUUCACUCAUCUUGUUACAAGCU +UAAAAGGACUAUGGACACUUCGUGCCUUCGGACGGCAGCCUUACUUUGAAACUCUGUUCCACAAAGCUC +UGAAUUUACAUACUGCCAACUGGUUCUUGUACCUGUCAACACUGCGCUGGUUCCAAAUGAGAAUAGAAA +UGAUUUUUGUCAUCUUCUUCAUUGCUGUUACCUUCAUUUCCAUUUUAACAACAGGAGAAGGAGAAGGAA +GAGUUGGUAUUAUCCUGACUUUAGCCAUGAAUAUCAUGAGUACAUUGCAGUGGGCUGUAAACUCCAGCA +UAGAUGUGGAUAGCUUGAUGCGAUCUGUGAGCCGAGUCUUUAAGUUCAUUGACAUGCCAACAGAAGGUA +AACCUACCAAGUCAACCAAACCAUACAAGAAUGGCCAACUCUCGAAAGUUAUGAUUAUUGAGAAUUCAC +ACGUGAAGAAAGAUGACAUCUGGCCCUCAGGGGGCCAAAUGACUGUCAAAGAUCUCACAGCAAAAUACA +CAGAAGGUGGAAAUGCCAUAUUAGAGAACAUUUCCUUCUCAAUAAGUCCUGGCCAGAGGGUGGGCCUCU +UGGGAAGAACUGGAUCAGGGAAGAGUACUUUGUUAUCAGCUUUUUUGAGACUACUGAACACUGAAGGAG +AAAUCCAGAUCGAUGGUGUGUCUUGGGAUUCAAUAACUUUGCAACAGUGGAGGAAAGCCUUUGGAGUGA +UACCACAGAAAGUAUUUAUUUUUUCUGGAACAUUUAGAAAAAACUUGGAUCCCUAUGAACAGUGGAGUG +AUCAAGAAAUAUGGAAAGUUGCAGAUGAGGUUGGGCUCAGAUCUGUGAUAGAACAGUUUCCUGGGAAGC +UUGACUUUGUCCUUGUGGAUGGGGGCUGUGUCCUAAGCCAUGGCCACAAGCAGUUGAUGUGCUUGGCUA +GAUCUGUUCUCAGUAAGGCGAAGAUCUUGCUGCUUGAUGAACCCAGUGCUCAUUUGGAUCCAGUAACAU +ACCAAAUAAUUAGAAGAACUCUAAAACAAGCAUUUGCUGAUUGCACAGUAAUUCUCUGUGAACACAGGA +UAGAAGCAAUGCUGGAAUGCCAACAAUUUUUGGUCAUAGAAGAGAACAAAGUGCGGCAGUACGAUUCCA +UCCAGAAACUGCUGAACGAGAGGAGCCUCUUCCGGCAAGCCAUCAGCCCCUCCGACAGGGUGAAGCUCU +UUCCCCACCGGAACUCAAGCAAGUGCAAGUCUAAGCCCCAGAUUGCUGCUCUGAAAGAGGAGACAGAAG +AAGAGGUGCAAGAUACAAGGCUUUAG
    Happy debugging golfed code,
    Scott
      I'm not sure how you got those results, and the code you posted had some trouble running too. Apparently the __DATA__ wasn't being imported correctly.

      I changed that to a definition:
      $cftr="AUGCAGAGGUCGCCUCUGGAAA...";
      Everything ran fine after that, except that japhy just spins for a while and then outputs nothing. Otherwise, the results appear to be as expected.

      Update: With respect to scain's update, this update basically says that I didn't actually read his update, and so, this entire node is kind of pointless.
Re: (Golf) RNA Genetic Code Translator
by scain (Curate) on Jul 09, 2001 at 19:19 UTC
    So I got around to benchmarking today. I was struck by the large variability in the benchmarks--2 orders of magnitude difference. So can someone explain why tadman's and MeowChow's subs are so much faster?

    tadman original: timethis 100000: 70 wallclock secs (70.00 usr + 0.02 sys = 70.02 CPU) @ 1428.16/s (n=100000)
    MeowChow: timethis 100000: 13 wallclock secs (12.25 usr + 0.00 sys = 12.25 CPU) @ 8163.27/s (n=100000)
    no_slogan: timethis 100000: 81 wallclock secs (78.22 usr + 0.04 sys = 78.26 CPU) @ 1277.79/s (n=100000)
    srawls: timethis 100000: 56 wallclock secs (53.80 usr + 0.07 sys = 53.87 CPU) @ 1856.32/s (n=100000)
    tachyon: timethis 100000: 90 wallclock secs (89.07 usr + 0.06 sys = 89.13 CPU) @ 1121.96/s (n=100000)
    tadman golf: timethis 100000: 1 wallclock secs ( 0.83 usr + 0.00 sys = 0.83 CPU) @ 120481.93/s (n=100000)

    Scott


    Experimental method:

    I placed each sub in a holder script and executed it. The script looked like this:

    #!/usr/bin/perl use Benchmark; my $iter=100000; while (<DATA>){ #tadman original timethis($iter, sub { $_=pop;y/UCAG/0123/;s/(.)(.)(.)/substr "FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG" ,$1<<4|$2<<2|$3,1/ge;y/0123//d;$_ }); } __DATA__ yada yada... the CFTR mRNA from above.
      I think MeowChow's would be nearly as fast, except that for reasons of brevity it performs this crazy map operation on every base-pair triplet substitution.

      On this train of thought, is there such as thing as a Benchmark-type routine that will test performance on a variety of data sizes? So many times people benchmark a variety of routines with only one set of data, which has the result of being a 1-dimensional test where there are actually 2 independent variables (function and data set size).

      In line with Big-O Notation, is it possible and/or has someone written a Benchmark-type module which would estimate what kind of O(f(n)) function would best represent how the algorithm in question scales? Certainly not trivial by any means, but not impossible either.
        This is not exactly what you asked for, but it would be easy to place several sequences in the __DATA__ chunk, one to a line, and since the timethis is in a while loop, it will iterate over each seqence. If you wanted to be really anal, you could then pipe the output to another perl program to parse and do statistics.

        Scott

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://94263]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-03-29 07:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found