Re^3: When the input file is huge !!!

FYI a simple merge-sort will take log(n)/log(2) passes through the data. You are describing a somewhat more intelligent merge that will take fewer passes. Plus you are eliminating most of the passes by using a different sort at once.

The merge is likely somewhat slower than you describe. The problem is how parallel read/write streams the drive can handle without resorting to streaming. My memory says that the limit is 4. Which means that you can at most read 3 chunks and write 1. If you try to do more, then your disk starts to thrash. If that is right, then it can handle 16 hunks in 3 passes through the dataset, where each pass has to both read and write all of the data. That is in addition to the original pass that sorts it into chunks. That means the dataset has to be streamed 8 times (4 times from disk, 4 times to). If streaming takes 1 minute per pass (that was what the OP said about his dataset) that would be 8 minutes. That is substantially faster than what I said, but I like to be conservative when I tell people that things will be slow because I don't want them to prematurely give up if it is sower than they expect.

Also note that a pure Perl implementation probably will have trouble sorting 500 MB chunks of data in RAM because of the memory overhead of internal data structures. However if you are reading from and writing to the filesystem, you can wind up creating, writing, reading and deleting a file with all operations happening in RAM without getting flushed to disk. If intelligently used, this fact can greatly increase performance because most passes do not touch disk. This can make a fairly naive merge-sort perform reasonably close to a much more optimized one.

Comment on Re^3: When the input file is huge !!!

Replies are listed 'Best First'.
Re^4: When the input file is huge !!! by BrowserUk (Patriarch) on Jan 06, 2009 at 21:46 UTC
The biggest challenge sorting FASTA format files, is that they are variable length multi-line records which most system sorts cannot handle. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^5: When the input file is huge !!! by tilly (Archbishop) on Jan 07, 2009 at 03:01 UTC
Surely it should be easy to have one pass that turns it into a single-line format, and a second that turns it back into the original format? Do a sort between them and you are good to go. Or you can write the sort in Perl. :-)	[reply]
Re^6: When the input file is huge !!! by BrowserUk (Patriarch) on Jan 07, 2009 at 04:03 UTC
Writing pre and post filters that convert from/to FASTA/single line records isn't hard, and is, (can be, so long as you don't use Bio::*), relatively fast. The problem then is that some of the sequences can be so long, that some system sort utilties can not handle the line length. Sad but true. Doing a sort in Perl--pure Perl--that goes beyond a few 10s of millions of records is a complete waste of time. It requires so much memory per item, that it almost always results in either swapping or 'Out of memory'. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^7: When the input file is huge !!! by tilly (Archbishop) on Jan 07, 2009 at 21:08 UTC