Re^4: finding open reading frames

Replies are listed 'Best First'.
Re^5: finding open reading frames by thanos1983 (Parson) on Jun 06, 2017 at 16:17 UTC
Hello Anonymous Monk Maybe you are right let's Benchmark them and see: I know that the while loop exceeds the foreach loop but in my case I did not made many differences but I was aiming to remove unnecessary lines. #!usr/bin/perl use strict; use warnings; use Benchmark::Forking qw( timethese cmpthese ); # UnixOS # use Benchmark qw(:all) ; # WindowsOS my @starts; sub previous { open (FASTA, "sequence.fa") \|\| die "Cannot open file: $!.\n"; chomp (my @seq = <FASTA>); close FASTA; shift @seq; my $sequence = join ('', @seq); @seq = split ('', $sequence); for (my $i=0; $i<=$#seq-5; $i++){ ## -5 könnte man weglassen # start codon: ATG # stopp codon: TAA, TGA, TAG # multiple of 3 between start and stop if ($seq[$i] eq 'A' && $seq[$i+1] eq 'T' && $seq[$i+2] eq 'G') { push (@starts, $i); for (my $j=$i+3; $j<=$#seq-2; $j=$j+3){ if ( ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'A +') \|\| ($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq 'A +') \|\| ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'G +') ) { # print "ORF: $i-", ($j+2), "\n"; last; ##lasts the j loop } } } } return; } sub update { open my $fh, "sequence.fa" or die "Could not open file: $!"; while (defined( $_ = <$fh>)) { chomp; next if $. < 2; # Skip first line my @seq = split '', $_; for (0..$#seq-5){ if ($seq[$_] eq 'A' && $seq[$_+1] eq 'T' && $seq[$_+2] eq 'G') + { push (@starts, $_); for (my $j=$_+3; $j<=$#seq-2; $j=$j+3){ if ( ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] e +q 'A') \|\| ($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq 'A +') \|\| ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'G +') ) { # print "ORF: $_-", ($j+2), "\n"; last; ##lasts the j loop } } } } } continue { # close ARGV if eof; # reset $. } close $fh or die "Could not close file: $!"; return; } my $results = timethese(1000000, { Previous => \&previous, Updated => \&update }, 'none'); cmpthese( $results ); __END__ $ perl bio_test.pl Rate Previous Updated Previous 2224/s -- -14% Updated 2601/s 17% -- [download] See also my update proposed solution, the second update should resolve the question and should be also faster. Seeking for Perl wisdom...on the process of learning...not there...yet!	[reply] [d/l] [select]
Re^6: finding open reading frames by Anonymous Monk on Jun 06, 2017 at 17:46 UTC
The O(n**2) nested-loop performance is going to kill you on some datasets. For a really pathological one, try: `$sequence = 'ATG' x 1e6;` [download] I estimate that your code would take about a day and a half to process that. My code handles it in just over a second. The human genome is around 3e9 base-pairs long. That's small enough to fit it all in memory, but large enough that you need to use efficient algorithms on it.	[reply] [d/l]