in reply to Re^2: size on disk of tied hashes
in thread size on disk of tied hashes
If you find Berkeley too slow, or too memory expensive in practice, you might reconsider the md5 hashing suggestion. I've implemented this for my own purposes now, and the highlights are:
Indexing 100_000_000, 160 byte records in a data file.
That's locating in the index entry and reading the record combined.
Worse timing: 1000 trials of binsearch ( 37.753s total), 37.753ms/trial
Best timing: 10000 trials of binsearch ( 175.755s total), 17.576ms/trial
Update: A 100,000 thrials just completed:
100000 trials of binsearch ( 1,643s total), 16.437ms/trial
This can be improved, I believe substantially.
Insertion appends the new record to the end of the data file, and inserts the appropriate index entry.
* Deletion consists of removing the index entry and adding it to a "deleted.idx" file.
The actual record remains in-place until a compaction process is run. The latter is not part of the timing above.
The above is hitting the disk (or system caches) for every read. I have some ideas for adding my own buffering , but they need testing.
The test system was XP/NTFS on a 2.2 Ghz processor with a WDC WD800BB-75CAA0 75GB HD.
The datafile is compressed, the index not.
For contrast, I gave up waiting for Berkeley to build a BTree index from a pre-sorted file of index entries after 18+ hours and 57% complete. Hence, I don't have figures for access/insertions or deletions.
|
|---|