ic23oluk has asked for the wisdom of the Perl Monks concerning the following question:
Dear PerlMonks, I try to find all possible open reading frames in a sequence and print them out. I have already written a working program, but I want an alternative that is more efficient (I currently have two loops). The idea is to store the position of starts, store the position of the stops and then look for pairs that are in frame. The latter point is what makes it difficult to me. How can I look for in-frame pairs? here's the code for my first solution that works:
my $sequence = "sequence.fa"; my @starts; open (FASTA, "$sequence") || die "Cannot open $sequence: $!.\n"; chomp (my @seq = <FASTA>); close FASTA; shift @seq; $sequence = join ('', @seq); @seq = split ('', $sequence); for (my $i=0; $i<=$#seq-5; $i++){ ## -5 könnte man weglassen # start codon: ATG # stopp codon: TAG, TAA, TGA # multiple of 3 between start and stop if ($seq[$i] eq 'A' && $seq[$i+1] eq 'T' && $seq[$i+2] eq 'G') +{ push (@starts, $i); for (my $j=$i+3; $j<=$#seq-2; $j=$j+3){ if (($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2 +] eq 'A') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq + 'A') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq + 'G')){ print "ORF: $i-", ($j+2), "\n"; last; ##lasts the j loop } } } }
I would be very pleased if anyone wrote an alternative algorithm according to my aforementioned demands. Thanks in advance!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: finding open reading frames
by tybalt89 (Monsignor) on Jun 05, 2017 at 15:16 UTC | |
Lacking a test case, I'm guessing a little...
I'm also printing a little extra for checking purposes. | [reply] [d/l] |
Re: finding open reading frames
by thanos1983 (Parson) on Jun 05, 2017 at 14:53 UTC | |
Hello ic23oluk, Can we have a sample of your data (couple of lines) not the whole file so we can replicate the script and see how we can improve it? Update: I see from your script $sequence = join ('', @seq); @seq = split ('', $sequence); it looks you are joining and then splitting the array. I am not sure what are you doing there this is why I need to see your input. Update2: Another possible approach instead of the complex if conditions that you have you could use grep but I need to practice with your code to see what is the expected output as you said that the script works. Update3: From a quick search online I think you should be looking into getting familiar with BioPerl. Update4: I think you should be looking for something like this Bio::SeqIO? You can also find some relevant information from similar question Parse DNA fasta file. Update5: I noticed that you have not provided us yet with a sample of input data and desired output, so I can not really try tackle and help you. But as I was thinking about your problem, I think that you might be interested in using the index function. It will return the sub string position withing your string of data. I case you want to check for multiple sub strings just use the last location and after as a position. Read more about it on the documentation. Looking forward to your update. Hope this helps.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
by ic23oluk (Sexton) on Jun 06, 2017 at 07:32 UTC | |
the input file in Fasta format has always a one line header followed by a sequence, i.e.: >NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA the join and split operations provide an array in which every Nucleotide (letter) is an element (instead of every line).
| [reply] |
by thanos1983 (Parson) on Jun 06, 2017 at 11:45 UTC | |
Hello ic23oluk, I do not have that much time to modify and experiment with your code but a few minor modifications can make your script faster:
Update: Another possible way, this is how I would approach it. This is 3/4 of the solution, I do not want to give you the full solution as this looks like school work. It should not be really difficult to continue from here. As I said before use index for multiple occurrences in the string.
Notice that the index is going back to zero, keep track of it and it should give you the same results as your primary script. Let us know if there is something that is not clear. Update2: If the space in between the sequences in normal you do not need to split and you can loop through like this:
Now you just need to apply an if condition and voila. :D Hope this helps.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
by tybalt89 (Monsignor) on Jun 06, 2017 at 14:48 UTC | |
by Anonymous Monk on Jun 06, 2017 at 14:18 UTC | |
by thanos1983 (Parson) on Jun 06, 2017 at 16:17 UTC | |
| |
Re: finding open reading frames
by Anonymous Monk on Jun 05, 2017 at 15:20 UTC | |
Output: Explanation of the output:
| [reply] [d/l] [select] |
Re: finding open reading frames
by glasswalk3r (Friar) on Jun 06, 2017 at 13:57 UTC | |
I'll not even try to understand what your code does (and other already provided some material for you on that), but here are some general guidelines:
Alceu Rodrigues de Freitas Junior --------------------------------- "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill | [reply] |
Re: finding open reading frames
by johngg (Canon) on Jun 06, 2017 at 18:30 UTC | |
A one loop solution that uses regexen to find the positions of all starters and stoppers then, in a loop over starters, uses List::Util::first() to find the next stopper that fits the criteria.
The output:
I hope this is of interest. Cheers, JohnGG | [reply] [d/l] [select] |
by Anonymous Monk on Jun 06, 2017 at 18:40 UTC | |
| [reply] [d/l] |
Re: finding open reading frames
by thanos1983 (Parson) on Jun 06, 2017 at 23:58 UTC | |
Hello again ic23oluk, I spend some time and I got even more confused. I got different results from your scrip:
At this point I do not know if your script is giving you the correct output. I am not sure if mine also works perfectly but I tried to confirm that by using grep and the tags e.g. TAA to check if I should get an output as my script does and yours skips. For example:
Based on that I can tell that I am getting confident that your script is not calculating something correctly. Update: I took a look again on the bash search (remember to remove first line from file!):
The output of my script:
Update2: If you check also the ATG
Now check one by one the index and the final output of my script:
Observe that the script is searching after the ATG as you requested (check indexes). It looks correct to me. Take a look and give it a try to verify it or not. Hope this helps instead of making things more confusing.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
Re: finding open reading frames
by thanos1983 (Parson) on Jun 07, 2017 at 09:27 UTC | |
Hello ic23oluk, Ok this is the final and correct solution all the rest of my solutions are wrong.
Why this is the correct one? Because:
As you can see from the sample above, we have 24 matches with the key word ATG. But why we get 23 results from the script above? Simply because there is nothing matching from the data file after the last ATG. Update2: Since you said you want maximum speed you can replace the foreach loop with a while loop.
Hope this helps and that you are still following your own question :D
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
by ic23oluk (Sexton) on Jun 08, 2017 at 12:16 UTC | |
first of all: thanks you spend that much effort on this. I need some time to go through this :D | [reply] |
Re: finding open reading frames
by Anonymous Monk on Jun 05, 2017 at 14:43 UTC | |
| [reply] |