in reply to Re^3: finding open reading frames
in thread finding open reading frames

Your "improvements" do nothing to improve the speed of the script. The problem is that the nested loops can be very slow. This data cannot be processed one line at a time, because the interesting sequences may span lines.

Replies are listed 'Best First'.
Re^5: finding open reading frames
by thanos1983 (Parson) on Jun 06, 2017 at 16:17 UTC

    Hello Anonymous Monk

    Maybe you are right let's Benchmark them and see:

    I know that the while loop exceeds the foreach loop but in my case I did not made many differences but I was aiming to remove unnecessary lines.

    #!usr/bin/perl use strict; use warnings; use Benchmark::Forking qw( timethese cmpthese ); # UnixOS # use Benchmark qw(:all) ; # WindowsOS my @starts; sub previous { open (FASTA, "sequence.fa") || die "Cannot open file: $!.\n"; chomp (my @seq = <FASTA>); close FASTA; shift @seq; my $sequence = join ('', @seq); @seq = split ('', $sequence); for (my $i=0; $i<=$#seq-5; $i++){ ## -5 könnte man weglassen # start codon: ATG # stopp codon: TAA, TGA, TAG # multiple of 3 between start and stop if ($seq[$i] eq 'A' && $seq[$i+1] eq 'T' && $seq[$i+2] eq 'G') { push (@starts, $i); for (my $j=$i+3; $j<=$#seq-2; $j=$j+3){ if ( ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'A +') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq 'A +') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'G +') ) { # print "ORF: $i-", ($j+2), "\n"; last; ##lasts the j loop } } } } return; } sub update { open my $fh, "sequence.fa" or die "Could not open file: $!"; while (defined( $_ = <$fh>)) { chomp; next if $. < 2; # Skip first line my @seq = split '', $_; for (0..$#seq-5){ if ($seq[$_] eq 'A' && $seq[$_+1] eq 'T' && $seq[$_+2] eq 'G') + { push (@starts, $_); for (my $j=$_+3; $j<=$#seq-2; $j=$j+3){ if ( ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] e +q 'A') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq 'A +') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'G +') ) { # print "ORF: $_-", ($j+2), "\n"; last; ##lasts the j loop } } } } } continue { # close ARGV if eof; # reset $. } close $fh or die "Could not close file: $!"; return; } my $results = timethese(1000000, { Previous => \&previous, Updated => \&update }, 'none'); cmpthese( $results ); __END__ $ perl bio_test.pl Rate Previous Updated Previous 2224/s -- -14% Updated 2601/s 17% --

    See also my update proposed solution, the second update should resolve the question and should be also faster.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
      The O(n**2) nested-loop performance is going to kill you on some datasets. For a really pathological one, try:
      $sequence = 'ATG' x 1e6;
      I estimate that your code would take about a day and a half to process that. My code handles it in just over a second.

      The human genome is around 3e9 base-pairs long. That's small enough to fit it all in memory, but large enough that you need to use efficient algorithms on it.