in reply to sorting very large text files
Check out the Samtools project - I think over on sourceforge. Depending upon the aligner you are using, there are either tools to munge the data into a SAM or BAM file or your aligner may already produce one. SAM is the generic alignment format for short reads - BAM is the binary equivalent. Once you have the file into SAM or BAM format, you can use SAM tools to sort and create an index of the short reads. From there you can use the indexes to pull out chromosome specific reads, etc. You'll then need something like a BAM->BED converter to use downstream tools or to display in the genome browser of your choice. I typically find converting a SAM or read file into BAM reduces the size of the file to about 10% of the text size. Of course you then need to be able to read the BAM file and that is where the downstream tools like Picard, Bio-SAMTools (Perl), Pysam and cl-sam come in.
If you want to roll your own, follow the good monks advice - split the file, parse the aligned reads into different chromosome files and then sort on the individual chromosomes in each file. You can concatenate it all back together in the end.
Good luck!
MadraghRua
yet another biologist hacking perl....
|
---|