Re: Increasing the efficiency of a viral clonal expansion model

I know I went a little overboard with the comments but I wanted to try to make things more clear.

I understand the feeling, but if your variable and function names are meaningful (and you use functions), you need less comments

I'll give you an example, this part of your code

          
              for my$i(0..($replication_rounds-1))
            {
                 for my $j (0..(@{$allgen{$i}}))
                 {
                    my $month=(int($i/30)+1); # the counter divided by
+ 30 will give a decimal, the int rounds it down... therefore .2 -->0 
+(then the +1) yields a january
                    if ($month<10){$month="0".$month;} ## too add a ze
+ro which mat lab reads easier so 09 instead of 9 for sept
                    my $date="000".(int($month/12))."/".$month."/".($i
+-(($month-1)*30)); # year is set at 0000 for now.... change 000 to 20
+0 or whatever to make it a specific year
                  
                    print "\>RandomHA ". $date ." "."\n" . $allgen{$i}
+[$j-1] . "\n";
                  
                  # to print a file automatically in fasta format
                    #open(FILE,">>Rand_HA_.txt");
                    #print FILE "\>RandomHA ". $date ." " ."\n" . $all
+gen{$i}[$j-1] . "\n";
                    #close(FILE);
                }
            }
[download]

Run through perltidy -csc -otr -opr -ce -nibc -i=4 and manually tweaked

for my $i ( 0 .. ( $replication_rounds - 1 ) ) {
    for my $j ( 0 .. ( @{ $allgen{$i} } ) ) {
# the counter divided by 30 will give a decimal,
#~ the int rounds it down...
#~ therefore .2 -->0 (then the +1) yields a january
        my $month = ( int( $i / 30 ) + 1 );
        if ( $month < 10 ) {
            $month = "0" . $month;
        }
## too add a zero which mat lab reads easier so 09 instead of 9 for se
+pt
        my $date = "000"
          . ( int( $month / 12 ) ) . "/"
          . $month . "/"
          . ( $i - ( ( $month - 1 ) * 30 ) ) ;
# year is set at 0000 for now....
# change 000 to 200 or whatever to make it a specific year

        print "\>RandomHA " . $date . " " . "\n" . $allgen{$i}[ $j - 1
+ ] . "\n";

    } ## end for my $j ( 0 .. ( @{ $allgen...}))
} ## end for my $i ( 0 .. ( $replication_rounds...))
###################
[download]

I would rewrite as


for my $i ( 0 .. ( $replication_rounds - 1 ) ) {
    for my $j ( 0 .. ( @{ $allgen{$i} } ) ) {
        my $date = BlahIjMeSomeDate( 200, $i, $j );
        print "\>RandomHA $date \n" . $allgen{$i}[ $j - 1 ] ."\n";
    }
}
[download]

Instead of $j - 1 and @{ $allgen{$i} you probably wanted

for my $i ( 0 .. ( $replication_rounds - 1 ) ) {
    for my $j ( 0 .. $#{ $allgen{$i} }  ) {
        my $date = BlahIjMeSomeDate( 200, $i, $j );
        print "\>RandomHA $date \n" . $allgen{$i}[ $j ] ."\n";
    }
}
[download]

See, doesn't that make more sense?

Similarly foreach my $nucleotide (@nucleotide_array) { ... I would rewrite as

foreach my $nucleotide (@nucleotide_array) {
    chomp $nucleotide;
    my $score = ( int rand(100000) ) / 100000;
    ModifyNucleotideBasedOnScore( \$nucleotide, $score );
}

sub ModifyNucleotideBasedOnScore {
   my( $nucleotide, $score ) = @_;
   ...
   $$nucleotide =~ s/A/T/g;
[download]

What you should document is the overall algorithm

Now I'm not sure about this bioinformatics thing, but if $nucleotide is one character long, you can speed up the program by NOT using substitution operator, use assignment.

I let your program run for 10min and it grew to about 140mb in memory, at which point I killed it

Changing =~ s/... to straight assignment, shaved off a few minutes off (it reached ~140mb in memory in about 8 min)

To make sure you're squeezing the most speed out of the shuffle make sure you're using the XS version of List::Util (see About List::Util's pure Perl shuffle())

Since your program seems to be CPU/Memory bound and not IO bound, you could probably speed it up with threads, something like Re: Thread::Pool shutdown dies after abort or

my $maxNumberOfThreads = 5;
my $threadCount = 0;
for ( $i = 1 ; $i < $replication_rounds ; $i++ ) {
    my $thr = threads->create(
        {'context' => 'list'},
        \&ReturnSomeFilteredNewGen,
    );
    $threadCount++;
    next if $threadCount < $maxNumberOfThreads ;
    JoinJoinables(\%allgen);
}
JoinJoinables(\%allgen);


sub JoinJoinables {
    my( $allgen ) = @_;
    while( my @joinable = threads->list(threads::joinable) ){
        for my $thr( @joinable ){
            my( $i, @filtered_new_gen ) = $thr->join();
            $$allgen{$i} = \@filtered_new_gen;
        }
    }
}
[download]

Comment on Re: Increasing the efficiency of a viral clonal expansion model Select or Download Code

Replies are listed 'Best First'.
Re^2: Increasing the efficiency of a viral clonal expansion model by ZWcarp (Beadle) on Jul 06, 2011 at 19:21 UTC
So you mean change it to this--> foreach my $nucleotide(@nucleotide_array) { chomp $nucleotide; my($T,$G,$C,$A); #These mutations rates reflect 10^-5 if ($nucleotide eq 'A'){ # $T=.1;$G=.2;$C=.3;$A=.7; my $score = (int rand(100000))/1 +00000; if( $score>=00000 && $score<=.00 +033){ $nucleotide='T';} if( $score>=.000334 && $score<=. +00066){ $nucleotide='G';} if( $score>=.00067 && $score<=.0 +0099){ $nucleotide='C';} # if( $score>=.300 && $score<=1.0 +0){ Do NOTHING } } } elsif ($nucleotide eq 'T'){ # $A=.1;$G=.2;$C=.3;$T=.7; my $score = (int rand(100000))/1 +00000; if( $score>=00000 && $score<=.00 +0333){ $nucleotide='A';} if( $score>=.00034 && $score<=.0 +0066){ $nucleotide='G';} if( $score>=.00067 && $score<=.0 +0099){ $nucleotide='C';} #if( $score>=.300 && $score<=1.0 +0){ DO NOTHING } } } elsif ($nucleotide eq 'C'){ # $T=.1;$G=.2;$A=.3;$C=.7; my $score = (int rand(100000))/1 +00000; if( $score>=00000 && $score<=.00 +033){ $nucleotide='T';} if( $score>=.00034 && $score<=.0 +0066){ $nucleotide='G';} if( $score>=.00067 && $score<=.0 +0099){ $nucleotide='A';} #if( $score>=.300 && $score<=1.0 +0){ #No mutation} } elsif ($nucleotide eq 'G'){ # $T=.1;$A=.2;$C=.3;$G=.7; my $score = (int rand(100000))/1 +00000; if( $score>=00000 && $score<=.00 +033){ $nucleotide='T';} if( $score>=.00034 && $score<=.0 +0066){ $nucleotide='A';} if( $score>=.00067 && $score<=.0 +0099){ $nucleotide='C';} #if( $score>=.300 && $score<=1.0 +0){ #No mutation} } $new_string= $new_string . $nucleotide; [download]	[reply] [d/l]
Re^3: Increasing the efficiency of a viral clonal expansion model by BrowserUk (Patriarch) on Jul 06, 2011 at 19:27 UTC
Why are you chomping individual characters? `my @nucleotide_array = split //, $seq_string; foreach my $nucleotide ( @nucleotide_array ) { chomp $nucleotide;` [download] It may do no harm to the algorithmm, but calling a function millions of times to do nothing is just silly. Especially when you are concerned with performance. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Increasing the efficiency of a viral clonal expansion model by ZWcarp (Beadle) on Jul 06, 2011 at 19:34 UTC
Yeah I suppose you're right. I just type it by habit. Good point!	[reply]
Re^5: Increasing the efficiency of a viral clonal expansion model by BrowserUk (Patriarch) on Jul 06, 2011 at 19:42 UTC
Re^6: Increasing the efficiency of a viral clonal expansion model by ZWcarp (Beadle) on Jul 08, 2011 at 16:31 UTC
Some notes below your chosen depth have not been shown here
Re^3: Increasing the efficiency of a viral clonal expansion model by Anonymous Monk on Jul 06, 2011 at 20:14 UTC
Yes, that is what I mean. Consider this benchmark Read more... (2 kB) And the results Read more... (1445 Bytes) For the length of string in your program (1702) using assignment (SplitAssign ) is faster than using regex (SplitRegex)	[reply] [d/l] [select]