in reply to Speeding a disk based hash

What kind of database are you originally loading from? It sounds like you are slurping some kind of flat file and hashing for searches.

It might be better to populate the BerkeleyDB once and keep it around as the primary database. Other databases like SQLite might be preferable.

A sharper picture of what you're doing would help.

After Compline,
Zaxo

Replies are listed 'Best First'.
Re^2: Speeding a disk based hash
by albert (Monk) on Oct 11, 2004 at 02:37 UTC
    I am trying to load data from flat files into Postgres. The script is a a bulk loader which generates individual text files for COPY loading of each table, and indexes are only created once the tables are loaded into Postgres. In the script, I need to keep track of the IDs as they are generated. Without using the disk based hash, this is a nice way to greatly speed import from a straight load into a schema with indexes, etc.

    I'm only using BerkeleyDB to track the hash while building the input tables.

      If speed is the issue, avoiding the disk is your best technique. And if the issue with that is that you're using too much memory for your hash, then their are ways to build a datastructure that has some subset of the properties of a hash, without requiring the memory requirements of a hash.

      Whether these are useful in your situation will depend on which properties if a hash you need for your application, and which you do not?

      Your description says "I need to keep track of the IDs as they are generated...", which suggests you are using the hash to do 'fast lookups'. If this is the only property you need of the hash, then you might be able to use the methods I described in A (memory) poor man's <strike>hash</strike> lookup table.. It's still slower than a hash, but much faster than a disk-based one, and depending upon the size and makeup of the keys can use as little as 15% of the memory required by a real hash.

      More information on how you're using the hash, how many and what type & range of keys, is needed to decide if it is applicable to your requirements.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      Another aproach that may or may not work, depending on your exact requirements, is loading the keys into memory in blocks. If you're going through the indexes sequentially, then it may make sense to first load, say, 1000 from disk, then use those from memory. When you are done with that block, purge it and load the next 1000. This way, you spread your disk access out over time, with periods of pure memory access in between.

      I suspect this might not work in your situation, but, as everyone else keeps saying, you haven't really provided enough information for us to know for sure. Hopefully by throwing random ideas out there, we'll come up with one that works. :-)