Re^4: When the input file is huge !!!

The biggest challenge sorting FASTA format files, is that they are variable length multi-line records which most system sorts cannot handle.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^4: When the input file is huge !!!

Replies are listed 'Best First'.
Re^5: When the input file is huge !!! by tilly (Archbishop) on Jan 07, 2009 at 03:01 UTC
Surely it should be easy to have one pass that turns it into a single-line format, and a second that turns it back into the original format? Do a sort between them and you are good to go. Or you can write the sort in Perl. :-)	[reply]
Re^6: When the input file is huge !!! by BrowserUk (Patriarch) on Jan 07, 2009 at 04:03 UTC
Writing pre and post filters that convert from/to FASTA/single line records isn't hard, and is, (can be, so long as you don't use Bio::*), relatively fast. The problem then is that some of the sequences can be so long, that some system sort utilties can not handle the line length. Sad but true. Doing a sort in Perl--pure Perl--that goes beyond a few 10s of millions of records is a complete waste of time. It requires so much memory per item, that it almost always results in either swapping or 'Out of memory'. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^7: When the input file is huge !!! by tilly (Archbishop) on Jan 07, 2009 at 21:08 UTC
No problem, you just don't do it all in RAM.. I have written on-disk merge sorts in Perl that worked quite well. I even had to write one once where the dataset I was dealing with was, uncompressed, larger than the hard drive of the machine I was working on! The only painful part was the "processing took several days" bit, but considering that it was for a one-time backfill, that was acceptable. (More painful was the process of iterating through and tracking down discrepancies with the ongoing job I was trying to backfill. Every bug found required redoing large portions of the load from scratch. That was a painful month.)	[reply]