hello rnaeye (nice pun)

Check out the Samtools project - I think over on sourceforge. Depending upon the aligner you are using, there are either tools to munge the data into a SAM or BAM file or your aligner may already produce one. SAM is the generic alignment format for short reads - BAM is the binary equivalent. Once you have the file into SAM or BAM format, you can use SAM tools to sort and create an index of the short reads. From there you can use the indexes to pull out chromosome specific reads, etc. You'll then need something like a BAM->BED converter to use downstream tools or to display in the genome browser of your choice. I typically find converting a SAM or read file into BAM reduces the size of the file to about 10% of the text size. Of course you then need to be able to read the BAM file and that is where the downstream tools like Picard, Bio-SAMTools (Perl), Pysam and cl-sam come in.

If you want to roll your own, follow the good monks advice - split the file, parse the aligned reads into different chromosome files and then sort on the individual chromosomes in each file. You can concatenate it all back together in the end.

Good luck!

MadraghRua
yet another biologist hacking perl....


In reply to Re: sorting very large text files by MadraghRua
in thread sorting very large text files by rnaeye

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.