I have similar issues in tackling next generation sequencing technology outputs. Typically we're looking at short sequence reads in the range of 35 or so characters. Depending upon the technology, they are either predominantly letter based or number based. As a learning project I've been looking at repeat sequences from the human genome and trying to come up with an indexed set - the idea being I can simply remove these from the original reads and concentrate on working with non repeat sequence reads.

To date the best thing I've found is breaking the reads down based upon sequence complexity and producing several sorted indices - using BerkelyDB, DB_File or my own creations.You can also use Perl's hashes to sort keys in this fashion, which can be a useful tool. So far I've been finding preindexing is key, though I've yet to find a really satisfactory way to do it more efficiently in Perl

You might want to look into a new tool called Bowtie. It uses a Burrows-Wheeler index to index the reads and then provide fast look up to perform alignments. It is reported to have a really fast assembly time for genomic data.

Another alternative is to look at Genomatix GMBH - a bioinformatics company based in Germany. They also have a proprietary indexing scheme that permits fast sequence alignment. Unfortunately the algorithm is not published for this one, but their approach is tp preprocess the genome of interest into kmers ranging from 8mers to million mers and provide theri indexes with their software.

A final suggestion is check out Ewan Birney's Dynamite.

I've been finding that for smaller genome projects (< 5 reference sequence tags, each 35 base in length) that hashes in Perl work quite well. Perl sorts the hashes based upon key complexity, as far as I can see. If you have access to a server farm and a clustering software like Gluster and load sharing that you could simply distribute the analysis over many nodes and perform the analysis in parallel.

So sorry - no specific module recommendations but perhaps looking at Bowtie or Genomatix might spark some ideas for you.

MadraghRua
yet another biologist hacking perl....


In reply to Re: fast+generic interface for range-based indexed retrieval by MadraghRua
in thread fast+generic interface for range-based indexed retrieval by jae_63

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.