in reply to Re^2: size on disk of tied hashes
in thread size on disk of tied hashes

If you find Berkeley too slow, or too memory expensive in practice, you might reconsider the md5 hashing suggestion. I've implemented this for my own purposes now, and the highlights are:

Indexing 100_000_000, 160 byte records in a data file.

  1. The index requires 2.23 GB for 100 Million records. Data record size does not affect the index size. It is a fixed 24-bytes/record.
  2. Building and sorting the index: 3 hours (worst case; under 1 hour best);
  3. Accessing 10_000 randomly chosen records: Under 3 minutes.

    That's locating in the index entry and reading the record combined.

    Worse timing: 1000 trials of binsearch ( 37.753s total), 37.753ms/trial

    Best timing: 10000 trials of binsearch ( 175.755s total), 17.576ms/trial

    Update: A 100,000 thrials just completed:

    100000 trials of binsearch ( 1,643s total), 16.437ms/trial

  4. Insert/delete* a record: Currently 1.2 seconds.

    This can be improved, I believe substantially.

    Insertion appends the new record to the end of the data file, and inserts the appropriate index entry.

    * Deletion consists of removing the index entry and adding it to a "deleted.idx" file.

    The actual record remains in-place until a compaction process is run. The latter is not part of the timing above.

The above is hitting the disk (or system caches) for every read. I have some ideas for adding my own buffering , but they need testing.

The test system was XP/NTFS on a 2.2 Ghz processor with a WDC WD800BB-75CAA0 75GB HD.

The datafile is compressed, the index not.

For contrast, I gave up waiting for Berkeley to build a BTree index from a pre-sorted file of index entries after 18+ hours and 57% complete. Hence, I don't have figures for access/insertions or deletions.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon