in reply to fast+generic interface for range-based indexed retrieval

The way I understand your problem, two possible solutions come to mind that use neither BerkelekyDB nor an RDBMS.

One is LDAP, which as a directory service is meant for frequent fast reads and slower writes. It handles searches on the server and such very easily.

The other is fixed-length record files. If every record is of a fixed length, the place to find a record is simply (record length * record number).

  • Comment on Re: fast+generic interface for range-based indexed retrieval

Replies are listed 'Best First'.
Re^2: fast+generic interface for range-based indexed retrieval
by CountZero (Bishop) on Dec 11, 2008 at 11:00 UTC
    If every record is of a fixed length, the place to find a record is simply (record length * record number)
    Indeed, but that still leaves you with the problem of compiling the list of the records you need to retrieve. Only if the "sequence" of records can be directly translated into a record number, is a fixed-length record scheme beneficial. Otherwise you will still need to build some kind of index which translates between the "sequence" and the record numbers.

    It is my experience that one leaves such things best to the database server. Any well-built database server should be optimized to make indexing and retrieval efficient and fast. I do not think any perl-based solution can compete with that in any other than the most trivial of cases. Just consider the time needed to save and load an index (hash?) of several "hundreds of millions" of keys and record-numbers. A modest estimation of say 25 bytes per key / value pair, gives you an index on disk (in YAML or JSON or Data::Dumper format) of several GigaBytes, which you will be hard pressed to keep in memory as a hash.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      It's true that a database server is optimized for complex queries, and that it's difficult to compete with that.

      The question I was answering had to do with a more generic non-DBMS way to handle the data, which is something jae_63 asked about. The most generic way that does not involve a DBMS is to make fixed-length entries and index into them. That this reinvents a very basic database I think is obvious.

      That a packaged database server is preferred when someone asks specifically how not to use one is open to debate I suppose. I'll not be part of that debate, as I offered an option and you are free to offer yours.

        How about some sort of reverse index? Maybe slice the range into N contiguous sub-ranges and maintain a list of all keys that fall within each sub-range. Then, when you need to find all keys between A and B, look at which sub-ranges intersect the interval (A B), and select keys from them.

        This may or may not be feasible for you, depending on your application, but if it is, it could reduce the query time dramatically.

        Jim