Could you please elaborate on why adding berkelyDB to the mix would be worse than Dbfile? Let me give a for instance. I have a bacterial genome of 5 million bases. I want to break this up into kmers of various sizes. I need to pay attention to both DNA strands, so I record the orientation in which I see the kmer. For each kmer I want to see if it has already been seen. If so, increment the number of times the kmer was seen, record where it was seen, record the orientation of the kmer. Now go through the kmers and record which ones have low sequence comlexity - lots of repeats or other characterisitcs that might make identifying overlapping kmers difficult, for instance. So now I have a number of different hashes, usually keyed to the kmer sequence which I will use in the next part of my project. I will probabaly also sort all the kmers in my hash/database to speed up the search process in the next steps. I amy even precompute a series of such kmer databases ahead of time for different sizes, simply to help with processing the data.
Now take a set of 140 milliion kmers from a next generation sequencing platform - the population covers both strands of the DNA. First question is how to quickly identify how many times each kmer from the reference genome was covered with a kmer from the next gen sequencing data. Are all the reference kmers represented or are some of them over or under represented?
Now we look for differences in the remaining kmers - do these represent base changes, base deletions or base insertions as compared to the reference genome. Again, you're doing a lot of hashing, counting and inferring based on this data.
Finally you get to create standardized files that will allow you to represent this information in a standard file format for display in a series of genome browsers.
My original thought had been that berkelyDB would be more robust for this type of large scale data processing project. Can you can provide more information on why DBfile is more effective in this approach than berkeleyDB?
MadraghRua
yet another biologist hacking perl....
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.