in reply to Re^5: When the input file is huge !!!
in thread When the input file is huge !!!

Writing pre and post filters that convert from/to FASTA/single line records isn't hard, and is, (can be, so long as you don't use Bio::*), relatively fast.

The problem then is that some of the sequences can be so long, that some system sort utilties can not handle the line length. Sad but true.

Doing a sort in Perl--pure Perl--that goes beyond a few 10s of millions of records is a complete waste of time. It requires so much memory per item, that it almost always results in either swapping or 'Out of memory'.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^7: When the input file is huge !!!
by tilly (Archbishop) on Jan 07, 2009 at 21:08 UTC
    No problem, you just don't do it all in RAM.. I have written on-disk merge sorts in Perl that worked quite well. I even had to write one once where the dataset I was dealing with was, uncompressed, larger than the hard drive of the machine I was working on! The only painful part was the "processing took several days" bit, but considering that it was for a one-time backfill, that was acceptable. (More painful was the process of iterating through and tracking down discrepancies with the ongoing job I was trying to backfill. Every bug found required redoing large portions of the load from scratch. That was a painful month.)