in reply to Re^10: Storing large data structures on disk
in thread Storing large data structures on disk

And is 200 an upper limit to the number of indices for one nucleotide for one experiment?

  • Comment on Re^11: Storing large data structures on disk

Replies are listed 'Best First'.
Re^12: Storing large data structures on disk
by roibrodo (Sexton) on Jun 01, 2010 at 08:36 UTC
    That's correct, although it's not so strict (there is some distribution of the number of results we have, it could be a bit higher).

      So it's actually a 2D array (nucleotide, experiment set) (10^10 * 10^3) where each cell points to a list of varying length (of up to, let's say, ~200) of integers.

      Besides that we have 10^3 arrays, each of size 10^5-10^6 (all of the same length, though), each array represents a specific experiment set. Each cell in such an array holds a reference to a small hash (a specific result).

        Hm. So you have 1e10 * 1e3 * 200 * 4-byte indices = 8 PetaBytes of indices pointing to 1e3 * 1e6 * (sizeof "small hash") = 1 billion small hashes. And you wanted to use Storable to load this from disk? You won't even be able to hold the 'smaller' dataset in memory, let alone 5% of the larger unless you have some pretty special hardware available,

        On x64 linux, you'll find that you have a 256 GB maximum memory size which means each of your 1 billion small hashes would have to be less than 274 bytes including overhead. Which given that %a = 'a'..'z'; requires 1383 bytes on x64, means they would have to be very small to fit into a fully populated x64 machine, even if you exclude any memory used by the OS.

        And the largest single file (ext3) is 16 Terabytes, so you would have to split your large dataset across 500 files minimum--assuming you could find a NASD device capable of handling 8 petabytes.

        Your only reasonable way forward (using commodity hardware) is to a) stop over estimating your growth potential; b) partition your dataset in usable subsets.

        I was under the impression that the largest genome was the human at 3e9?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.