fast+generic interface for range-based indexed retrieval

jae_63 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: fast+generic interface for range-based indexed retrieval by perrin (Chancellor) on Dec 10, 2008 at 18:50 UTC
BerkeleyDB cursors handle this exact issue. You just need to use the BTree option so that the records will be sorted.	[reply]
Re^2: fast+generic interface for range-based indexed retrieval by Zarchne (Novice) on Dec 12, 2008 at 06:12 UTC
Aha, with DB_SET_RANGE. Otherwise, I was wondering if storing a bitmap of the keys used would be useful. Probably not.	[reply]
Re^3: fast+generic interface for range-based indexed retrieval by perrin (Chancellor) on Dec 12, 2008 at 08:18 UTC
No, you don't need DB_SET_RANGE. You get the cursor on the key you want with DB_GET and then keep calling DB_NEXT until you're done.	[reply]
Re^2: fast+generic interface for range-based indexed retrieval by jae_63 (Beadle) on Dec 12, 2008 at 14:38 UTC
Thanks for all the responses. In fact I am using BerkeleyDB's BTrees (and cursors) to get good performance, it's just that my code winds up being specialized for each database and rather ugly. My keys (to date) are usually floating point numbers which I format like `%012.7f` since these numbers will never exceed 9999. I perform the one-time database load using db_load. It sounds as though I can't do much better than my current solution and still obtain good performance. One thing that I can do to clean things up is use a Perl iterator. Perhaps I can set this up to include a few parameters and hooks to define the format of the query string. This would yield the generic interface that I'm looking for, and perhaps would be worth releasing as a module. The Bowtie remark was interesting; I'm well-acquainted with that relatively obscure package (as well as the equally obscure Dynamite), but don't think that the very cool and underutilized Burrows-Wheeler transform is applicable here. Thanks again to all ...	[reply]
Re^3: fast+generic interface for range-based indexed retrieval by BrowserUk (Patriarch) on Dec 12, 2008 at 17:21 UTC
My keys (to date) are usually floating point numbers which I format like %012.7f since these numbers will never exceed 9999. Do you need 7 decimal places of accuracy? If your application could get by with 6, you could multiply your floats by 1e6 and store them as integers--which would probably reduce space and speed things up a lot whatever DB you use. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: fast+generic interface for range-based indexed retrieval by jae_63 (Beadle) on Dec 14, 2008 at 05:11 UTC
Re^5: fast+generic interface for range-based indexed retrieval by BrowserUk (Patriarch) on Dec 14, 2008 at 09:17 UTC
Some notes below your chosen depth have not been shown here
Re: fast+generic interface for range-based indexed retrieval by mr_mischief (Monsignor) on Dec 10, 2008 at 17:51 UTC
The way I understand your problem, two possible solutions come to mind that use neither BerkelekyDB nor an RDBMS. One is LDAP, which as a directory service is meant for frequent fast reads and slower writes. It handles searches on the server and such very easily. The other is fixed-length record files. If every record is of a fixed length, the place to find a record is simply (record length * record number).	[reply]
Re^2: fast+generic interface for range-based indexed retrieval by CountZero (Bishop) on Dec 11, 2008 at 11:00 UTC
If every record is of a fixed length, the place to find a record is simply (record length record number)* Indeed, but that still leaves you with the problem of compiling the list of the records you need to retrieve. Only if the "sequence" of records can be directly translated into a record number, is a fixed-length record scheme beneficial. Otherwise you will still need to build some kind of index which translates between the "sequence" and the record numbers. It is my experience that one leaves such things best to the database server. Any well-built database server should be optimized to make indexing and retrieval efficient and fast. I do not think any perl-based solution can compete with that in any other than the most trivial of cases. Just consider the time needed to save and load an index (hash?) of several "hundreds of millions" of keys and record-numbers. A modest estimation of say 25 bytes per key / value pair, gives you an index on disk (in YAML or JSON or Data::Dumper format) of several GigaBytes, which you will be hard pressed to keep in memory as a hash. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^3: fast+generic interface for range-based indexed retrieval by mr_mischief (Monsignor) on Dec 11, 2008 at 17:22 UTC
It's true that a database server is optimized for complex queries, and that it's difficult to compete with that. The question I was answering had to do with a more generic non-DBMS way to handle the data, which is something jae_63 asked about. The most generic way that does not involve a DBMS is to make fixed-length entries and index into them. That this reinvents a very basic database I think is obvious. That a packaged database server is preferred when someone asks specifically how not to use one is open to debate I suppose. I'll not be part of that debate, as I offered an option and you are free to offer yours.	[reply]
Re^4: fast+generic interface for range-based indexed retrieval by James Board (Beadle) on Dec 12, 2008 at 01:42 UTC
Re: fast+generic interface for range-based indexed retrieval by CountZero (Bishop) on Dec 10, 2008 at 19:42 UTC
How is your range of indexed keys defined? Can it be reduced to a sequence (numeric or alphabetical). SQL could then do the trick: `SELECT * FROM TABLE WHERE KEY_FIELD >= 'sequence_begin' AND KEY_FIELD +<= 'sequence_end'` [download] or if the number of indexed keys is not too big: `SELECT * FROM TABLE WHERE KEY_FIELD IN (key1, key2, key3, ... , keyN)` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: fast+generic interface for range-based indexed retrieval by duckyd (Hermit) on Dec 10, 2008 at 23:22 UTC
And if SQL can do the trick, but you want to avoid a full DBMS, you might look at DBD::SQLite	[reply]
Re^3: fast+generic interface for range-based indexed retrieval by CountZero (Bishop) on Dec 11, 2008 at 10:38 UTC
A valid comment. However, as the OP was speaking of hundreds of millions of records, I wonder whether SQLite would scale well for such large databases? I have only used it for small projects, but if it can handle in an efficient way such large volumes, I might consider it in the future. Do you have any experience in that respect? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^4: fast+generic interface for range-based indexed retrieval by Anonymous Monk on Dec 12, 2008 at 01:27 UTC
Re: fast+generic interface for range-based indexed retrieval by BrowserUk (Patriarch) on Dec 10, 2008 at 17:56 UTC
range of indexed keys As in `654321 .. 700000` or `DE* .. EF*` (or both)? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: fast+generic interface for range-based indexed retrieval by MadraghRua (Vicar) on Dec 11, 2008 at 18:18 UTC
I have similar issues in tackling next generation sequencing technology outputs. Typically we're looking at short sequence reads in the range of 35 or so characters. Depending upon the technology, they are either predominantly letter based or number based. As a learning project I've been looking at repeat sequences from the human genome and trying to come up with an indexed set - the idea being I can simply remove these from the original reads and concentrate on working with non repeat sequence reads. To date the best thing I've found is breaking the reads down based upon sequence complexity and producing several sorted indices - using BerkelyDB, DB_File or my own creations.You can also use Perl's hashes to sort keys in this fashion, which can be a useful tool. So far I've been finding preindexing is key, though I've yet to find a really satisfactory way to do it more efficiently in Perl You might want to look into a new tool called Bowtie. It uses a Burrows-Wheeler index to index the reads and then provide fast look up to perform alignments. It is reported to have a really fast assembly time for genomic data. Another alternative is to look at Genomatix GMBH - a bioinformatics company based in Germany. They also have a proprietary indexing scheme that permits fast sequence alignment. Unfortunately the algorithm is not published for this one, but their approach is tp preprocess the genome of interest into kmers ranging from 8mers to million mers and provide theri indexes with their software. A final suggestion is check out Ewan Birney's Dynamite. I've been finding that for smaller genome projects (< 5 reference sequence tags, each 35 base in length) that hashes in Perl work quite well. Perl sorts the hashes based upon key complexity, as far as I can see. If you have access to a server farm and a clustering software like Gluster and load sharing that you could simply distribute the analysis over many nodes and perform the analysis in parallel. So sorry - no specific module recommendations but perhaps looking at Bowtie or Genomatix might spark some ideas for you. MadraghRua yet another biologist hacking perl....	[reply]