It depends - if you know that this code will only ever run with a file where you now the format exactly then that's certainly true. Otherwise, it's quite easy, especially for the beginner, to use assumptions that my not always be met, e.g. that the sequence is always in a single line. Then you run your code against a different file and fail to realise that you have read only fractions of the real sequence and all your results are invalid. I would say that it is better to use a tried and tested module like BIo::SeqIO unless you are dealing with Gigabyte-sized files and speed is crucial. | [reply] |
Then you run your code against a different file and fail to realise that you have read only fractions of the real sequence and all your results are invalid.
Actually, my code above handles FASTA files, including multi-line sequences--that's why I posted it--but I do take your point about incomplete solutions.
However, the trouble with starting newbies off with something as big and slow as Bio::*, is that it puts them into the mindset that they have to use it or similar. And that means that when they do encounter its limitations--which given the prevalence of huge datasets in genomics, will be sooner rather than later--they are locked into that mindset and then start looking for SMP or clustering solutions to try and get their work done in a timely fashion.
The last time I checked, reading a 200MB/1million sequence fasta file took 11 seconds with my code, and "over half an hour" using Bio::SeqIO. Scale that up to the 2GB+ files mentioned by the OP in that other thread, and you have 5 hours versus 1 minute! And that's just for reading the file let alone actually doing something useful with it.
If you compare the code above, with the code in that other thread, you'll see that this code is just a slight tweak of that other code. That tweak is to make this version store the sequences in a hash so that they can be iterated over per this OPs requirements.
Yes. That could also be done using Bio::SeqIO. But, every time you need to access a sequence, you have to go through (from memory) 3 or 4 layers of function/method call. Each layer of subroutine call effectively doubles the time taken:
@a = 1 .. 1e6;;
sub get{ return $a[ $_[0] ] };;
sub get1{ get( @_ ) };;
sub get2{ get1( @_ ) };;
cmpthese -1, { a=>q[ my $x; $x = $a[$_] for 0 .. $#a], b=>q[my $x; $x
+= get2($_) for 0 .. $#a ] };;
(warning: too few iterations for a reliable count)
Rate b a
b 1.29/s -- -75%
a 5.27/s 307% --
3 times slower, and they are very efficient, do nothing subroutines. The real ones are far, far slower. And remember, for the OPs task, he needs to process all against all, thus squaring the performance difference.
If newbies learn those few lines of code above up front--which are IMO easier to learn than installing Bio::* and then wading through the docs to work out the appropriate incantations--then they are better placed to see how easily it is adapted to other requirements.
The typical newbie is likely to try and address the speed2 problem by copying the sequences into a local array. But then they'll be strong everything twice--once in their code and once inside Bio::SeqIO--and will quickly hit the limits of memory.
Of course, there is probably a non-storing FASTA file reading solution--or 3--somewhere in the Bio-mass. But finding it (them) and deciding which one to use will take far longer than working out how to tweak the above code, and it still doesn't address the (method call x 3)2 problem.
The best solution would of course be to reduce the overhead of the Bio::SeqIO hierarchy, but that hasn't happened.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
The last time I checked ...
Last time you checked? You didn't check, you took the reported and extremely slow value ("half hour") at face value, instead of comparing your version with a bioperl-version.
In that same thread I showed the difference between bioperl or non-bioperl was much smaller than "half hour" vs "11 seconds".
Now to advocate avoiding a heavy framework for speed is fair enough. But you shouldn't repeat extreme slowness unchecked & unvalidated (it's a bit similar to this slanderous assertion about database - though, of course, that was worse! :) ).
| [reply] |
Yes, of course you are absolutely right about the overhead of Bio modules and I also tend to use a simple parser myself if I know that this is only ever going to read from FASTA files. I'm just saying that trying to implement a FASTA parser yourself can be more tricky for the beginner than one might expect and trip you in unexpected ways.
Anyway, I think in this case it is the comparison of sequences rather than the parsing that is the main bottleneck.
| [reply] |