in reply to Re^3: When the input file is huge !!!
in thread When the input file is huge !!!

The biggest challenge sorting FASTA format files, is that they are variable length multi-line records which most system sorts cannot handle.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^5: When the input file is huge !!!
by tilly (Archbishop) on Jan 07, 2009 at 03:01 UTC
    Surely it should be easy to have one pass that turns it into a single-line format, and a second that turns it back into the original format? Do a sort between them and you are good to go.

    Or you can write the sort in Perl. :-)

      Writing pre and post filters that convert from/to FASTA/single line records isn't hard, and is, (can be, so long as you don't use Bio::*), relatively fast.

      The problem then is that some of the sequences can be so long, that some system sort utilties can not handle the line length. Sad but true.

      Doing a sort in Perl--pure Perl--that goes beyond a few 10s of millions of records is a complete waste of time. It requires so much memory per item, that it almost always results in either swapping or 'Out of memory'.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        No problem, you just don't do it all in RAM.. I have written on-disk merge sorts in Perl that worked quite well. I even had to write one once where the dataset I was dealing with was, uncompressed, larger than the hard drive of the machine I was working on! The only painful part was the "processing took several days" bit, but considering that it was for a one-time backfill, that was acceptable. (More painful was the process of iterating through and tracking down discrepancies with the ongoing job I was trying to backfill. Every bug found required redoing large portions of the load from scratch. That was a painful month.)