| [reply] |
Writing pre and post filters that convert from/to FASTA/single line records isn't hard, and is, (can be, so long as you don't use Bio::*), relatively fast.
The problem then is that some of the sequences can be so long, that some system sort utilties can not handle the line length. Sad but true.
Doing a sort in Perl--pure Perl--that goes beyond a few 10s of millions of records is a complete waste of time. It requires so much memory per item, that it almost always results in either swapping or 'Out of memory'.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
No problem, you just don't do it all in RAM.. I have written on-disk merge sorts in Perl that worked quite well. I even had to write one once where the dataset I was dealing with was, uncompressed, larger than the hard drive of the machine I was working on! The only painful part was the "processing took several days" bit, but considering that it was for a one-time backfill, that was acceptable. (More painful was the process of iterating through and tracking down discrepancies with the ongoing job I was trying to backfill. Every bug found required redoing large portions of the load from scratch. That was a painful month.)
| [reply] |