comment on

If you find Berkeley too slow, or too memory expensive in practice, you might reconsider the md5 hashing suggestion. I've implemented this for my own purposes now, and the highlights are:

Indexing 100_000_000, 160 byte records in a data file.

The index requires 2.23 GB for 100 Million records. Data record size does not affect the index size. It is a fixed 24-bytes/record.
Building and sorting the index: 3 hours (worst case; under 1 hour best);
Accessing 10_000 randomly chosen records: Under 3 minutes.
That's locating in the index entry and reading the record combined.
Worse timing: 1000 trials of binsearch ( 37.753s total), 37.753ms/trial
Best timing: 10000 trials of binsearch ( 175.755s total), 17.576ms/trial
Update: A 100,000 thrials just completed:
100000 trials of binsearch ( 1,643s total), 16.437ms/trial
Insert/delete* a record: Currently 1.2 seconds.
This can be improved, I believe substantially.
Insertion appends the new record to the end of the data file, and inserts the appropriate index entry.
* Deletion consists of removing the index entry and adding it to a "deleted.idx" file.
The actual record remains in-place until a compaction process is run. The latter is not part of the timing above.

The above is hitting the disk (or system caches) for every read. I have some ideas for adding my own buffering , but they need testing.

The test system was XP/NTFS on a 2.2 Ghz processor with a WDC WD800BB-75CAA0 75GB HD.

The datafile is compressed, the index not.

For contrast, I gave up waiting for Berkeley to build a BTree index from a pre-sorted file of index entries after 18+ hours and 57% complete. Hence, I don't have figures for access/insertions or deletions.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re^3: size on disk of tied hashes by BrowserUk
in thread size on disk of tied hashes by danderson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.