in reply to finding open reading frames

Hello ic23oluk,

Can we have a sample of your data (couple of lines) not the whole file so we can replicate the script and see how we can improve it?

Update: I see from your script $sequence = join ('', @seq); @seq = split ('', $sequence); it looks you are joining and then splitting the array. I am not sure what are you doing there this is why I need to see your input.

Update2: Another possible approach instead of the complex if conditions that you have you could use grep but I need to practice with your code to see what is the expected output as you said that the script works.

Update3: From a quick search online I think you should be looking into getting familiar with BioPerl.

Update4: I think you should be looking for something like this Bio::SeqIO? You can also find some relevant information from similar question Parse DNA fasta file.

Update5: I noticed that you have not provided us yet with a sample of input data and desired output, so I can not really try tackle and help you. But as I was thinking about your problem, I think that you might be interested in using the index function. It will return the sub string position withing your string of data. I case you want to check for multiple sub strings just use the last location and after as a position. Read more about it on the documentation.

Looking forward to your update.

Hope this helps.

Seeking for Perl wisdom...on the process of learning...not there...yet!

Replies are listed 'Best First'.
Re^2: finding open reading frames
by ic23oluk (Sexton) on Jun 06, 2017 at 07:32 UTC

    the input file in Fasta format has always a one line header followed by a sequence, i.e.:

    >NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA

    the join and split operations provide an array in which every Nucleotide (letter) is an element (instead of every line).
    thanks

      Hello ic23oluk,

      I do not have that much time to modify and experiment with your code but a few minor modifications can make your script faster:

      #!usr/bin/perl use strict; use warnings; my @starts = (); while (<>) { chomp; next if $. < 2; # Skip first line my @seq = split '', $_; for (0..$#seq-5){ if ($seq[$_] eq 'A' && $seq[$_+1] eq 'T' && $seq[$_+2] eq 'G') { push (@starts, $_); for (my $j=$_+3; $j<=$#seq-2; $j=$j+3){ if ( ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'A +') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'G' && $seq[$j+2] eq 'A +') || ($seq[$j] eq 'T' && $seq[$j+1] eq 'A' && $seq[$j+2] eq 'G +') ) { print "ORF: $_-", ($j+2), "\n"; last; ##lasts the j loop } } } } } continue { close ARGV if eof; # reset $. } __END__ $ perl bio_test.pl sequence.fa ORF: 16-75 ORF: 50-136 ORF: 133-357 ORF: 232-357 ORF: 252-290 ORF: 287-334 ORF: 305-334 ORF: 363-380 ORF: 394-450 ORF: 489-623 ORF: 575-724 ORF: 651-797 ORF: 689-724 ORF: 724-936 ORF: 854-859 ORF: 859-936 ORF: 954-959 ORF: 1051-1200 ORF: 1145-1255 ORF: 1228-1275 ORF: 1249-1275

      Update: Another possible way, this is how I would approach it. This is 3/4 of the solution, I do not want to give you the full solution as this looks like school work. It should not be really difficult to continue from here.

      As I said before use index for multiple occurrences in the string.

      #!usr/bin/perl use strict; use warnings; use Data::Dumper; my $substring = 'ATG'; while (<>) { chomp; next if $. < 2; # Skip first line my @seq = split ' ', $_; # print Dumper \@seq; # Remove hashtag and see how the input looks for (@seq) { my $found = index($_, $substring); while ($found != -1) { # check for multiple occurrences in the sam +e string print "Found $substring at $found\n"; # Do your calculations h +ere my $offset = $found + 1; # keep track of offset in your counte +r as you do on your primary script and should have it $found = index($_, $substring, $offset); } } } continue { close ARGV if eof; # reset $. } __END__ $ perl bio_test.pl sequence.fa Found 'ATG' at 16 Found 'ATG' at 50 Found 'ATG' at 62 Found 'ATG' at 19 Found 'ATG' at 39 Found 'ATG' at 3 Found 'ATG' at 21 Found 'ATG' at 8 Found 'ATG' at 39 Found 'ATG' at 63 Found 'ATG' at 7 Found 'ATG' at 12 Found 'ATG' at 50 Found 'ATG' at 14 Found 'ATG' at 2 Found 'ATG' at 7 Found 'ATG' at 31 Found 'ATG' at 20 Found 'ATG' at 50 Found 'ATG' at 57 Found 'ATG' at 9 Found 'ATG' at 21 Found 'ATG' at 42 Found 'ATG' at 65

      Notice that the index is going back to zero, keep track of it and it should give you the same results as your primary script. Let us know if there is something that is not clear.

      Update2: If the space in between the sequences in normal you do not need to split and you can loop through like this:

      #!usr/bin/perl use strict; use warnings; my $substring = 'ATG'; while (<>) { chomp; next if $. < 2; # Skip first line my $found = index($_, $substring); while ($found != -1) { # check for multiple occurrences in the sam +e string print "Found $substring at $found\n"; # Do your calculations here my $offset = $found + 1; # keep track of offset in your counter as + you do on your primary script and should have it $found = index($_, $substring, $offset); } } continue { close ARGV if eof; # reset $. } __END__ $ perl bio.pl sequence.fa Found ATG at 16 Found ATG at 50 Found ATG at 133 Found ATG at 232 Found ATG at 252 Found ATG at 287 Found ATG at 305 Found ATG at 363 Found ATG at 394 Found ATG at 489 Found ATG at 575 Found ATG at 651 Found ATG at 689 Found ATG at 724 Found ATG at 854 Found ATG at 859 Found ATG at 954 Found ATG at 1014 Found ATG at 1044 Found ATG at 1051 Found ATG at 1145 Found ATG at 1228 Found ATG at 1249 Found ATG at 1272

      Now you just need to apply an if condition and voila. :D

      Hope this helps.

      Seeking for Perl wisdom...on the process of learning...not there...yet!

        I think you left spaces in your file when you copy/pasted it. Note that the OP did not enclose file in code tags when it was posted.

        Your "improvements" do nothing to improve the speed of the script. The problem is that the nested loops can be very slow. This data cannot be processed one line at a time, because the interesting sequences may span lines.