in reply to size on disk of tied hashes

Build your own index to the data.

Use the MD5 of the key (binary 128-bits) + the file position (64-bits) = 24bytes * ~= 500 million records.

11 GB index file.

Sort by the MD5.

With fixed length records, writing a binary chop to locate the record's offset is relatively easy and gives you ~log(n) access time.

Still pushes you beyond your 40GB disk, but 60GB disks aren't that much more exspensive.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon