Hm. So you have 1e10 * 1e3 * 200 * 4-byte indices = 8 PetaBytes of indices pointing to 1e3 * 1e6 * (sizeof "small hash") = 1 billion small hashes. And you wanted to use Storable to load this from disk? You won't even be able to hold the 'smaller' dataset in memory, let alone 5% of the larger unless you have some pretty special hardware available,
On x64 linux, you'll find that you have a 256 GB maximum memory size which means each of your 1 billion small hashes would have to be less than 274 bytes including overhead. Which given that %a = 'a'..'z'; requires 1383 bytes on x64, means they would have to be very small to fit into a fully populated x64 machine, even if you exclude any memory used by the OS.
And the largest single file (ext3) is 16 Terabytes, so you would have to split your large dataset across 500 files minimum--assuming you could find a NASD device capable of handling 8 petabytes.
Your only reasonable way forward (using commodity hardware) is to a) stop over estimating your growth potential; b) partition your dataset in usable subsets.
I was under the impression that the largest genome was the human at 3e9?
In reply to Re^14: Storing large data structures on disk
by BrowserUk
in thread Storing large data structures on disk
by roibrodo
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |