Then you run your code against a different file and fail to realise that you have read only fractions of the real sequence and all your results are invalid.
Actually, my code above handles FASTA files, including multi-line sequences--that's why I posted it--but I do take your point about incomplete solutions.
However, the trouble with starting newbies off with something as big and slow as Bio::*, is that it puts them into the mindset that they have to use it or similar. And that means that when they do encounter its limitations--which given the prevalence of huge datasets in genomics, will be sooner rather than later--they are locked into that mindset and then start looking for SMP or clustering solutions to try and get their work done in a timely fashion.
The last time I checked, reading a 200MB/1million sequence fasta file took 11 seconds with my code, and "over half an hour" using Bio::SeqIO. Scale that up to the 2GB+ files mentioned by the OP in that other thread, and you have 5 hours versus 1 minute! And that's just for reading the file let alone actually doing something useful with it.
If you compare the code above, with the code in that other thread, you'll see that this code is just a slight tweak of that other code. That tweak is to make this version store the sequences in a hash so that they can be iterated over per this OPs requirements.
Yes. That could also be done using Bio::SeqIO. But, every time you need to access a sequence, you have to go through (from memory) 3 or 4 layers of function/method call. Each layer of subroutine call effectively doubles the time taken:
@a = 1 .. 1e6;;
sub get{ return $a[ $_[0] ] };;
sub get1{ get( @_ ) };;
sub get2{ get1( @_ ) };;
cmpthese -1, { a=>q[ my $x; $x = $a[$_] for 0 .. $#a], b=>q[my $x; $x
+= get2($_) for 0 .. $#a ] };;
(warning: too few iterations for a reliable count)
Rate b a
b 1.29/s -- -75%
a 5.27/s 307% --
3 times slower, and they are very efficient, do nothing subroutines. The real ones are far, far slower. And remember, for the OPs task, he needs to process all against all, thus squaring the performance difference.
If newbies learn those few lines of code above up front--which are IMO easier to learn than installing Bio::* and then wading through the docs to work out the appropriate incantations--then they are better placed to see how easily it is adapted to other requirements.
The typical newbie is likely to try and address the speed2 problem by copying the sequences into a local array. But then they'll be strong everything twice--once in their code and once inside Bio::SeqIO--and will quickly hit the limits of memory.
Of course, there is probably a non-storing FASTA file reading solution--or 3--somewhere in the Bio-mass. But finding it (them) and deciding which one to use will take far longer than working out how to tweak the above code, and it still doesn't address the (method call x 3)2 problem.
The best solution would of course be to reduce the overhead of the Bio::SeqIO hierarchy, but that hasn't happened.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|