If you find Berkeley too slow, or too memory expensive in practice, you might reconsider the md5 hashing suggestion. I've implemented this for my own purposes now, and the highlights are:
Indexing 100_000_000, 160 byte records in a data file.
That's locating in the index entry and reading the record combined.
Worse timing: 1000 trials of binsearch ( 37.753s total), 37.753ms/trial
Best timing: 10000 trials of binsearch ( 175.755s total), 17.576ms/trial
Update: A 100,000 thrials just completed:
100000 trials of binsearch ( 1,643s total), 16.437ms/trial
This can be improved, I believe substantially.
Insertion appends the new record to the end of the data file, and inserts the appropriate index entry.
* Deletion consists of removing the index entry and adding it to a "deleted.idx" file.
The actual record remains in-place until a compaction process is run. The latter is not part of the timing above.
The above is hitting the disk (or system caches) for every read. I have some ideas for adding my own buffering , but they need testing.
The test system was XP/NTFS on a 2.2 Ghz processor with a WDC WD800BB-75CAA0 75GB HD.
The datafile is compressed, the index not.
For contrast, I gave up waiting for Berkeley to build a BTree index from a pre-sorted file of index entries after 18+ hours and 57% complete. Hence, I don't have figures for access/insertions or deletions.
In reply to Re^3: size on disk of tied hashes
by BrowserUk
in thread size on disk of tied hashes
by danderson
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |