mab has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone have a faster way than using a cursor as in the example below?

If testing, note that the first time you run the script it's creating the db so that's not a legitimate case.

One thing I noticed is that the time to build the array is dependent on the data size. When I change "x 1000" to "x 4000" the time to build the array goes from 0.8 seconds to 1.6. I guess that surprises me because I imagined I wouldn't be hitting the data pages in the db, only the index pages. But I suppose only the hashed key is kept there and not the actual key.
use BerkeleyDB; $create = ! -f 'data.dbm'; $db = new BerkeleyDB::Btree ( -Filename => 'data.dbm', -Flags => DB_CREATE ) or die "Cannot open file: $!"; if ( $create ) { for ( my $i = 1; $i <= 100000; $i++ ) { $db->db_put ( $i, "data" x 1000 ); } } $cursor = $db->db_cursor ( ); @keys = ( ); $key = 0; $value = 0; $status; for ( $status = $cursor->c_get ( $key, $value, DB_FIRST ); $status == 0; $status = $cursor->c_get ( $key, $value, DB_NEXT ) ) { push @keys, $key; } $cursor->c_close ( ); $db->db_close ( ); print ( $#keys + 1 ) . "\n";

Replies are listed 'Best First'.
Re: What is a fast way to get keys of a Berkeley DB into an array?
by zwon (Abbot) on Jun 09, 2010 at 18:45 UTC

    You should use strict and warnings. Particularly

    print ( $#keys + 1 ) . "\n";
    is not what you want.

    Concerning you question, you're certainly hitting data pages, as you retrieving data values from database with c_get. Sorry, have no idea how to avoid this.

    PS it looks like the following is slightly faster:

    my $db = tie my %hash, 'BerkeleyDB::Btree', -Filename => 'data.dbm'; my @keys = keys %hash; print 0+@keys, "\n";
      Thanks - that gives me a 20% improvement for my single data point. I was holding out hope for an order of magnitude improvement but perhaps that isn't going to be achieved without maintaining some sort of key-only database.
Re: What is a fast way to get keys of a Berkeley DB into an array?
by BrowserUk (Patriarch) on Jun 09, 2010 at 20:45 UTC
    I imagined I wouldn't be hitting the data pages in the db, only the index pages. But I suppose only the hashed key is kept there and not the actual key.

    Your example creates a BTree type not a Hash type DB, so there is no "hashed key".

    Iterating a cursor in order to collect all the keys effectively means iterating the entire database. The point of cursors is to allow you to iterate the DB without creating such a collection first.

    I realise that there are legitimate reasons for collecting the keys--ie. other than for use in subsequently iterating the DB via that collection--but if that is a regular part of your applications requirements, you would do well to keep an "index node" yourself.

    This might be as simple as having a recognisable key that will not be a part of your regular keyset, with the associated data item containing the array of keys. That data item might for example be a Storable frozen array which you read once at the beginning of the application, maintain yourself internally as the DB is used and then write back before closing.

    Of course, you would be responsible for ensuring that index stayed coherent. Perhaps by including within it some kind of integrity check.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I guess I must be missing something but for a btree to work, you need to be able to compare the key you are looking for against the values in the internal index nodes of the tree. If the values in the index nodes are actual keys and not some analyzed version of them (which is what I was incorrectly calling the hashed key) then it seems there would be a way to construct the keys in the db by only visiting the index nodes. Since the number and size of these nodes shouldn't be affected by the amount of data in each row, I was a little surprised to see the performance change that I did. Obviously I'm missing something about the implementation but that was my reasoning and where my comment came from.

      Yes, your solution is the one I have in mind. The concern is maintaining coherency and that's why I was hoping to avoid it.
        If the values in the index nodes are actual keys and not some analyzed version of them (which is what I was incorrectly calling the hashed key) then it seems there would be a way to construct the keys in the db by only visiting the index nodes.

        Berkeley's BTree DBs are actually B+Trees, where all the values (*), are kept in the leaf nodes along with their respective keys. In-order traversal is done by locating the leaf block for the 'lowest' key, and then following peer-level pointers to move from leaf to leaf. Therefore, traversing the entire key-space will mean reading all the keys & values from disk.

        (*)Unless the data values are too large to store in the chosen page-size, in which case 'overflow records' are used.

        The concern is maintaining coherency and that's why I was hoping to avoid it.

        If you use the tied hash interface, you could override the STORE() and DELETE() methods to maintain the key list entry. You could also use the fact that BDB allows you to maintain more than one DB in a single file to store that list as the only record in a second DB. That ought to ensure that single record is perpetually cached.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: What is a fast way to get keys of a Berkeley DB into an array?
by almut (Canon) on Jun 09, 2010 at 19:05 UTC

    As the NEXTKEY implementation in BerkeleyDB.xs also uses c_get, it looks like there is no way to retrieve keys only (without retrieving values).  NEXTKEY is the function called under the hood by BerkeleyDB's tied-hash interface when you call keys %hash.

      It looks like the Berkeley DB C API does support getting multiple keys at once using the DB_MULTIPLE_KEY flag:

      http://pybsddb.sourceforge.net/api_c/dbc_get.html

      It's not apparent to me though that this functionality is accessible through the Berkeley DB Perl module.

        As I understand DB_MULTIPLE_KEY would still retrieve complete entries, i.e. keys plus values, so any performance improvement (if the functionality was available, which I think it isn't) would likely only be marginal.