Increasing the efficiency of a viral clonal expansion model

ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

I have created a draft script that represents a viral clonal expansion. It takes as input a seeder sequence of DNA, and then outputs a fasta file with labeled date. Right now I have it running 120 replication cycles, which represents "4 months" in the virtual timescale I have created. My issue is that the program is just too slow, and I was wondering if anyone had some suggestions for speeding it up. Essentially all it does is read in array, split by character, mutate based on probability, then pool all replications into a second array which is then fed into a hash (%allgen). There are parameters which effect 1, how many copies each individual sequence makes,2, how many replication cycles the virus goes through, and 3, how many sequences from the resulting pool of new viruses will be selected (at random) to see the following replciation cycle. Here is the script---->

#!/usr/bin/perl -w
use strict;
use List::Util 'shuffle';
use Data::Dumper qw(Dumper); 
#g=generation
# Current seeding sequence is HA
       my @g0= ("ATGAAGGCAATACTAGTAGTTCTGCTATATACATTTGCAACCGCAAATGCAGA
+CACATTATGTATAGGTTATCATGCGAACAATTCAACAGACACTGTAGACACAGTACTAGAAAAGAATGT
+AACAGTAACACACTCTGTTAACCTTCTAGAAGACAAGCATAACGGGAAACTATGCAAACTAAGAGGGGT
+AGCCCCATTGCATTTGGGTAAATGTAACATTGCTGGCTGGATCCTGGGAAATCCAGAGTGTGAATCACT
+CTCCACAGCAAGCTCATGGTCCTACATTGTGGAAACACCTAGTTCAGACAATGGAACGTGTTACCCAGG
+AGATTTCATCGATTATGAGGAGCTAAGAGAGCAATTGAGCTCAGTGTCATCATTTGAAAGGTTTGAGAT
+ATTCCCCAAGACAAGTTCATGGCCCAATCATGACTCGAACAAAGGTGTAACGGCAGCATGTCCTCATGC
+TGGAGCAAAAAGCTTCTACAAAAATTTAATATGGCTAGTTAAAAAAGGAAATTCATACCCAAAGCTCAG
+CAAATCCTACATTAATGATAAAGGGAAAGAAGTCCTCGTGCTATGGGGCATTCACCATCCATCTACTAG
+TGCTGACCAACAAAGTCTCTATCAGAATGCAGATACATATGTTTTTGTGGGGTCATCAAGATACAGCAA
+GAAGTTCAAGCCGGAAATAGCAATAAGACCCAAAGTGAGGGATCAAGAAGGGAGAATGAACTATTACTG
+GACACTAGTAGAGCCGGGAGACAAAATAACATTCGAAGCAACTGGAAATCTAGTGGTACCGAGATATGC
+ATTCGCAATGGAAAGAAATGCTGGATCTGGTATTATCATTTCAGATACACCAGTCCACGATTGCAATAC
+AACTTGTCAAACACCCAAGGGTGCTATAAACACCAGCCTCCCATTTCAGAATATACATCCGATCACAAT
+TGGAAAATGTCCAAAATATGTAAAAAGCACAAAATTGAGACTGGCCACAGGATTGAGGAATATCCCGTC
+TATTCAATCTAGAGGCCTATTTGGGGCCATTGCCGGTTTCATTGAAGGGGGGTGGACAGGGATGGTAGA
+TGGATGGTACGGTTATCACCATCAAAATGAGCAGGGGTCAGGATATGCAGCCGACCTGAAGAGCACACA
+GAATGCCATTGACGAGATTACTAACAAAGTAAATTCTGTTATTGAAAAGATGAATACACAGTTCACAGC
+AGTAGGTAAAGAGTTCAACCACCTGGAAAAAAGAATAGAGAATTTAAATAAAAAAGTTGATGATGGTTT
+CCTGGACATTTGGACTTACAATGCCGAACTGTTGGTTCTATTGGAAAATGAAAGAACTTTGGACTACCA
+CGATTCAAATGTGAAGAACTTATATGAAAAGGTAAGAAGCCAGCTAAAAAACAATGCCAAGGAAATTGG
+AAACGGCTGCTTTGAATTTTACCACAAATGCGATAACACGTGCATGGAAAGTGTCAAAAATGGGACTTA
+TGACTACCCAAAATACTCAGAGGAAGCAAAATTAAACAGAGAAGAAATAGATGGGGTAAAGCTGGAATC
+AACAAGGATTTACCAGATTTTGGCGATCTATTCAACTGTCGCCAGTTCATTGGTACTGGTAGTCTCCCT
+GGGGGCAATCAGTTTCTGGATGTGCTCTAATGGGTCTCTACAGTGTAGAATATGTATTTAA"); #Ini
+tial generation "seeding generation"
       
       
       
    #########
    ######### Paramters
    #########
    my $viralfitness=30; # The size of the range a random number is pi
+cked from that determines how many copies each viral sequence that ma
+kes it through a round will make
    my $replication_rounds=120; 
    my $portion_decimal=50; # this number (when divided by 100 below )
+ provides the range of fractions to select a random portion for the n
+ext gen so for example 50 will result in a random number being select
+ed from the range .1-.5, that number will then be mutiplied by the sc
+alar(@new_gen) to get the array coordinates to seed the next gen
    ###########
    ###########
    ###########
   my %allgen; #hash that holds sequences
   my $i;
   my $j;
   my @population_by_gen=(); # this array keeps track of all the virus
+es generated in each round including the ones that dont make it to th
+e next
   
   
   %allgen=(0 =>\@g0);
   
          for($i=1;$i<$replication_rounds;$i++) # This $i determines t
+he number of replications 
          {
             
             my @new_gen=();
             my @last_gen=@{$allgen{$i-1}};
             my @filtered_new_gen=();
              
              foreach my $seq_string(@last_gen)
              {
                  chomp $seq_string;
                    
                 my $vf_rand=int(rand($viralfitness))+1; #The random n
+umber selected from the range specified by $viralfitness;
                 
                 for($j=0;$j<($vf_rand);$j++) #This loop determines ho
+w many copies each sequence will make of itself 
                  {
                      my $new_string;
                      my @nucleotide_array = split(//,$seq_string);   
+     
                      foreach my $nucleotide(@nucleotide_array)
                      {
                          chomp $nucleotide;
                                  
                          my($T,$G,$C,$A);
                          #These mutations rates are just for testing 
+now they dont reflect accurate viral mutation rates yet
                              if ($nucleotide eq 'A'){
                                     # $T=.1;$G=.2;$C=.3;$A=.7;
                                      my $score = (int rand(100000))/1
+00000;
                                      if( $score>=00000 && $score<=.00
+033){ $nucleotide=~s/A/T/g;}                       
                                      if( $score>=.000334 && $score<=.
+00066){ $nucleotide=~s/A/G/g;} 
                                      if( $score>=.00067 && $score<=.0
+0099){ $nucleotide=~s/A/C/g;} 
                                     # if( $score>=.300 && $score<=1.0
+0){  Do NOTHING      } }
                                  }
                              elsif ($nucleotide eq 'T'){
                                     # $A=.1;$G=.2;$C=.3;$T=.7;
                                      my $score = (int rand(100000))/1
+00000;
                                      if( $score>=00000 && $score<=.00
+0333){ $nucleotide=~s/T/A/g;}                       
                                      if( $score>=.00034 && $score<=.0
+0066){ $nucleotide=~s/T/G/g;} 
                                      if( $score>=.00067 && $score<=.0
+0099){ $nucleotide=~s/T/C/g;} 
                                      #if( $score>=.300 && $score<=1.0
+0){  DO NOTHING      } }
                                  }
                              elsif ($nucleotide eq 'C'){
                                     # $T=.1;$G=.2;$A=.3;$C=.7;
                                      my $score = (int rand(100000))/1
+00000;
                                      if( $score>=00000 && $score<=.00
+033){ $nucleotide=~s/C/T/g;}                       
                                      if( $score>=.00034 && $score<=.0
+0066){ $nucleotide=~s/C/G/g;} 
                                      if( $score>=.00067 && $score<=.0
+0099){ $nucleotide=~s/C/A/g;} 
                                      #if( $score>=.300 && $score<=1.0
+0){  #No mutation} 
                                  }
                              elsif ($nucleotide eq 'G'){
                                     # $T=.1;$A=.2;$C=.3;$G=.7;
                                      my $score = (int rand(100000))/1
+00000;
                                      if( $score>=00000 && $score<=.00
+033){ $nucleotide=~s/G/T/g;}                       
                                      if( $score>=.00034 && $score<=.0
+0066){ $nucleotide=~s/G/A/g;} 
                                      if( $score>=.00067 && $score<=.0
+0099){ $nucleotide=~s/G/C/g;} 
                                      #if( $score>=.300 && $score<=1.0
+0){  #No mutation} 
                                  }
                          $new_string= $new_string . $nucleotide;
                      }
                      push(@new_gen,$new_string);
                  }         
              
              }       
         
          @new_gen=shuffle(@new_gen);
         
          $population_by_gen[0]=@g0;  # the input sequences
          $population_by_gen[$i]=@new_gen;  # The current total create
+d, including those that don't go on to replicate
  
  
            #these next conditionals prevent the new_gen from fizzling
+ out by making a larger range to select when the numbers are low    
            #The value to be changed is the +10(or whatever) that is b
+efore the )/100        
           
            # print "Sequences:".scalar(@new_gen)."\t";
                    
             if( (scalar(@new_gen)) < 10){
               my $v3=((int(rand($portion_decimal))+10)/100)*(scalar(@
+new_gen)); # the added number determins the minimum of the range that
+ a decimal can be selected from ie 20-50 =.20-.50
               
               for(my $k=0;$k<($v3);$k++){  $filtered_new_gen[$k]=$new
+_gen[$k];}
               
            # print "Survivors:$v3\n";
                }
             elsif( (scalar(@new_gen)) < 20){
               my $v3=((int(rand($portion_decimal))+10)/100)*(scalar(@
+new_gen)); # the added number determins the minimum of the range that
+ a decimal can be selected from ie 20-50 =.20-.50
               
               for(my $k=0;$k<($v3);$k++){  $filtered_new_gen[$k]=$new
+_gen[$k];}
               
            # print "Survivors:$v3\n";
                }
             else{
                   my $v3=((int(rand($portion_decimal))+10)/100)*(scal
+ar(@new_gen));
                   for(my $k=0;$k<($v3);$k++){  $filtered_new_gen[$k]=
+$new_gen[$k];}
                 #  print "Survivors: $v3\n";
                 }
             $allgen{$i}=\@filtered_new_gen;
        }
         
    ################## Print Block
    ################## 
           # print Dumper(\\%allgen); #Print this to view the sequence
+s in each generation
           # print Dumper(\\@population_by_gen);  #print this to view 
+the total progeny of each of the generations in the %allgen (inlcudin
+g the ones that didnt go on to populate the next gen     
              
            # print @{$allgen{1}} . "\n";
          
              for my$i(0..($replication_rounds-1))
            {
                 for my $j (0..(@{$allgen{$i}}))
                 {
                    my $month=(int($i/30)+1); # the counter divided by
+ 30 will give a decimal, the int rounds it down... therefore .2 -->0 
+(then the +1) yields a january
                    if ($month<10){$month="0".$month;} ## too add a ze
+ro which mat lab reads easier so 09 instead of 9 for sept
                    my $date="000".(int($month/12))."/".$month."/".($i
+-(($month-1)*30)); # year is set at 0000 for now.... change 000 to 20
+0 or whatever to make it a specific year
                  
                    print "\>RandomHA ". $date ." "."\n" . $allgen{$i}
+[$j-1] . "\n";
                  
                  # to print a file automatically in fasta format
                    #open(FILE,">>Rand_HA_.txt");
                    #print FILE "\>RandomHA ". $date ." " ."\n" . $all
+gen{$i}[$j-1] . "\n";
                    #close(FILE);
                }
            }
  ###################
[download]

I know I went a little overboard with the comments but I wanted to try to make things more clear. I know this is a doozy of a script so I very much appreciate anyone who takes the time to look. Biology suggestions also welcome for any of you with that background. This is a wonderful resource thanks for your time.

Comment on Increasing the efficiency of a viral clonal expansion model Download Code

Replies are listed 'Best First'.
Re: Increasing the efficiency of a viral clonal expansion model by Anonymous Monk on Jul 06, 2011 at 16:15 UTC
I know I went a little overboard with the comments but I wanted to try to make things more clear. I understand the feeling, but if your variable and function names are meaningful (and you use functions), you need less comments I'll give you an example, this part of your code Read more... (3 kB) I would rewrite as Read more... (517 Bytes) `for my $i ( 0 .. ( $replication_rounds - 1 ) ) { for my $j ( 0 .. $#{ $allgen{$i} } ) { my $date = BlahIjMeSomeDate( 200, $i, $j ); print "\>RandomHA $date \n" . $allgen{$i}[ $j ] ."\n"; } }` [download] See, doesn't that make more sense? Similarly `foreach my $nucleotide (@nucleotide_array) { ...` I would rewrite as `foreach my $nucleotide (@nucleotide_array) { chomp $nucleotide; my $score = ( int rand(100000) ) / 100000; ModifyNucleotideBasedOnScore( \$nucleotide, $score ); } sub ModifyNucleotideBasedOnScore { my( $nucleotide, $score ) = @_; ... $$nucleotide =~ s/A/T/g;` [download] What you should document is the overall algorithm Now I'm not sure about this bioinformatics thing, but if $nucleotide is one character long, you can speed up the program by NOT using substitution operator, use assignment. I let your program run for 10min and it grew to about 140mb in memory, at which point I killed it Changing =~ s/... to straight assignment, shaved off a few minutes off (it reached ~140mb in memory in about 8 min) To make sure you're squeezing the most speed out of the shuffle make sure you're using the XS version of List::Util (see About List::Util's pure Perl shuffle()) Since your program seems to be CPU/Memory bound and not IO bound, you could probably speed it up with threads, something like Re: Thread::Pool shutdown dies after abort or my $maxNumberOfThreads = 5; my $threadCount = 0; for ( $i = 1 ; $i < $replication_rounds ; $i++ ) { my $thr = threads->create( {'context' => 'list'}, \&ReturnSomeFilteredNewGen, ); $threadCount++; next if $threadCount < $maxNumberOfThreads ; JoinJoinables(\%allgen); } JoinJoinables(\%allgen); sub JoinJoinables { my( $allgen ) = @_; while( my @joinable = threads->list(threads::joinable) ){ for my $thr( @joinable ){ my( $i, @filtered_new_gen ) = $thr->join(); $$allgen{$i} = \@filtered_new_gen; } } } [download]	[reply] [d/l] [select]
Re^2: Increasing the efficiency of a viral clonal expansion model by ZWcarp (Beadle) on Jul 06, 2011 at 19:21 UTC
So you mean change it to this--> foreach my $nucleotide(@nucleotide_array) { chomp $nucleotide; my($T,$G,$C,$A); #These mutations rates reflect 10^-5 if ($nucleotide eq 'A'){ # $T=.1;$G=.2;$C=.3;$A=.7; my $score = (int rand(100000))/1 +00000; if( $score>=00000 && $score<=.00 +033){ $nucleotide='T';} if( $score>=.000334 && $score<=. +00066){ $nucleotide='G';} if( $score>=.00067 && $score<=.0 +0099){ $nucleotide='C';} # if( $score>=.300 && $score<=1.0 +0){ Do NOTHING } } } elsif ($nucleotide eq 'T'){ # $A=.1;$G=.2;$C=.3;$T=.7; my $score = (int rand(100000))/1 +00000; if( $score>=00000 && $score<=.00 +0333){ $nucleotide='A';} if( $score>=.00034 && $score<=.0 +0066){ $nucleotide='G';} if( $score>=.00067 && $score<=.0 +0099){ $nucleotide='C';} #if( $score>=.300 && $score<=1.0 +0){ DO NOTHING } } } elsif ($nucleotide eq 'C'){ # $T=.1;$G=.2;$A=.3;$C=.7; my $score = (int rand(100000))/1 +00000; if( $score>=00000 && $score<=.00 +033){ $nucleotide='T';} if( $score>=.00034 && $score<=.0 +0066){ $nucleotide='G';} if( $score>=.00067 && $score<=.0 +0099){ $nucleotide='A';} #if( $score>=.300 && $score<=1.0 +0){ #No mutation} } elsif ($nucleotide eq 'G'){ # $T=.1;$A=.2;$C=.3;$G=.7; my $score = (int rand(100000))/1 +00000; if( $score>=00000 && $score<=.00 +033){ $nucleotide='T';} if( $score>=.00034 && $score<=.0 +0066){ $nucleotide='A';} if( $score>=.00067 && $score<=.0 +0099){ $nucleotide='C';} #if( $score>=.300 && $score<=1.0 +0){ #No mutation} } $new_string= $new_string . $nucleotide; [download]	[reply] [d/l]
Re^3: Increasing the efficiency of a viral clonal expansion model by BrowserUk (Patriarch) on Jul 06, 2011 at 19:27 UTC
Why are you chomping individual characters? `my @nucleotide_array = split //, $seq_string; foreach my $nucleotide ( @nucleotide_array ) { chomp $nucleotide;` [download] It may do no harm to the algorithmm, but calling a function millions of times to do nothing is just silly. Especially when you are concerned with performance. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Increasing the efficiency of a viral clonal expansion model by ZWcarp (Beadle) on Jul 06, 2011 at 19:34 UTC
Re^5: Increasing the efficiency of a viral clonal expansion model by BrowserUk (Patriarch) on Jul 06, 2011 at 19:42 UTC
Some notes below your chosen depth have not been shown here
Re^3: Increasing the efficiency of a viral clonal expansion model by Anonymous Monk on Jul 06, 2011 at 20:14 UTC
Yes, that is what I mean. Consider this benchmark Read more... (2 kB) And the results Read more... (1445 Bytes) For the length of string in your program (1702) using assignment (SplitAssign ) is faster than using regex (SplitRegex)	[reply] [d/l] [select]