in reply to Re^3: Bioinformatic task
in thread Bioinformatic task

Then you run your code against a different file and fail to realise that you have read only fractions of the real sequence and all your results are invalid.

Actually, my code above handles FASTA files, including multi-line sequences--that's why I posted it--but I do take your point about incomplete solutions.

However, the trouble with starting newbies off with something as big and slow as Bio::*, is that it puts them into the mindset that they have to use it or similar. And that means that when they do encounter its limitations--which given the prevalence of huge datasets in genomics, will be sooner rather than later--they are locked into that mindset and then start looking for SMP or clustering solutions to try and get their work done in a timely fashion.

The last time I checked, reading a 200MB/1million sequence fasta file took 11 seconds with my code, and "over half an hour" using Bio::SeqIO. Scale that up to the 2GB+ files mentioned by the OP in that other thread, and you have 5 hours versus 1 minute! And that's just for reading the file let alone actually doing something useful with it.

If you compare the code above, with the code in that other thread, you'll see that this code is just a slight tweak of that other code. That tweak is to make this version store the sequences in a hash so that they can be iterated over per this OPs requirements.

Yes. That could also be done using Bio::SeqIO. But, every time you need to access a sequence, you have to go through (from memory) 3 or 4 layers of function/method call. Each layer of subroutine call effectively doubles the time taken:

@a = 1 .. 1e6;; sub get{ return $a[ $_[0] ] };; sub get1{ get( @_ ) };; sub get2{ get1( @_ ) };; cmpthese -1, { a=>q[ my $x; $x = $a[$_] for 0 .. $#a], b=>q[my $x; $x += get2($_) for 0 .. $#a ] };; (warning: too few iterations for a reliable count) Rate b a b 1.29/s -- -75% a 5.27/s 307% --

3 times slower, and they are very efficient, do nothing subroutines. The real ones are far, far slower. And remember, for the OPs task, he needs to process all against all, thus squaring the performance difference.

If newbies learn those few lines of code above up front--which are IMO easier to learn than installing Bio::* and then wading through the docs to work out the appropriate incantations--then they are better placed to see how easily it is adapted to other requirements.

The typical newbie is likely to try and address the speed2 problem by copying the sequences into a local array. But then they'll be strong everything twice--once in their code and once inside Bio::SeqIO--and will quickly hit the limits of memory.

Of course, there is probably a non-storing FASTA file reading solution--or 3--somewhere in the Bio-mass. But finding it (them) and deciding which one to use will take far longer than working out how to tweak the above code, and it still doesn't address the (method call x 3)2 problem.

The best solution would of course be to reduce the overhead of the Bio::SeqIO hierarchy, but that hasn't happened.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^5: Bioinformatic task
by erix (Prior) on Nov 08, 2010 at 13:57 UTC
    The last time I checked ...

    Last time you checked? You didn't check, you took the reported and extremely slow value ("half hour") at face value, instead of comparing your version with a bioperl-version.

    In that same thread I showed the difference between bioperl or non-bioperl was much smaller than "half hour" vs "11 seconds".

    Now to advocate avoiding a heavy framework for speed is fair enough. But you shouldn't repeat extreme slowness unchecked & unvalidated (it's a bit similar to this slanderous assertion about database - though, of course, that was worse! :) ).

      True. I took the OP at face value. Just as I did with you and your timings in the db/sort case.

      However, after the I first became aware of the performance problem, I did eventually get (parts of) BioPerl installed on my own (old) machine and verified that the problem existed. I also tried to hunt down the cause.

      One possibility that was muted for this was the use of $`, $& or $'. Although I suggested to the OP that he try Devel::SawAmpersand, there was no follow up, so I hacked my way to installing Bio::SeqIO::fasta and all the required support packages, manually. (The full package has never installed for me.) I didn't detect that particular problem, but saw the machinations the module went through (back then; I see quite a lot of it is now commented out), to read a text file, and attributed the (at that point, quite extreme) slowness to that. I also attempted to track the module changes for a while, but that went out the window when I got my current machine.

      Long story short, when the OP reports "half an hour" for 200MB, that fit with my prior experiences, and I took him at his word. As your reply in that thread was not to me, I probably never saw it. So, if you are responsible for the performance improvements in the module, I congratulate you.

      All of that said, performance was only one of the issues I raised above.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^5: Bioinformatic task
by tospo (Hermit) on Nov 08, 2010 at 12:20 UTC
    Yes, of course you are absolutely right about the overhead of Bio modules and I also tend to use a simple parser myself if I know that this is only ever going to read from FASTA files. I'm just saying that trying to implement a FASTA parser yourself can be more tricky for the beginner than one might expect and trip you in unexpected ways. Anyway, I think in this case it is the comparison of sequences rather than the parsing that is the main bottleneck.
      in this case it is the comparison of sequences rather than the parsing that is the main bottleneck.

      Agreed, but that's no reason to add an unnecessary extra half an hour to the process :)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        unless you want to go and get a coffee... :-)