in reply to Re^12: Storing large data structures on disk
in thread Storing large data structures on disk

So it's actually a 2D array (nucleotide, experiment set) (10^10 * 10^3) where each cell points to a list of varying length (of up to, let's say, ~200) of integers.

Besides that we have 10^3 arrays, each of size 10^5-10^6 (all of the same length, though), each array represents a specific experiment set. Each cell in such an array holds a reference to a small hash (a specific result).

  • Comment on Re^13: Storing large data structures on disk

Replies are listed 'Best First'.
Re^14: Storing large data structures on disk
by BrowserUk (Patriarch) on Jun 01, 2010 at 09:23 UTC

    Hm. So you have 1e10 * 1e3 * 200 * 4-byte indices = 8 PetaBytes of indices pointing to 1e3 * 1e6 * (sizeof "small hash") = 1 billion small hashes. And you wanted to use Storable to load this from disk? You won't even be able to hold the 'smaller' dataset in memory, let alone 5% of the larger unless you have some pretty special hardware available,

    On x64 linux, you'll find that you have a 256 GB maximum memory size which means each of your 1 billion small hashes would have to be less than 274 bytes including overhead. Which given that %a = 'a'..'z'; requires 1383 bytes on x64, means they would have to be very small to fit into a fully populated x64 machine, even if you exclude any memory used by the OS.

    And the largest single file (ext3) is 16 Terabytes, so you would have to split your large dataset across 500 files minimum--assuming you could find a NASD device capable of handling 8 petabytes.

    Your only reasonable way forward (using commodity hardware) is to a) stop over estimating your growth potential; b) partition your dataset in usable subsets.

    I was under the impression that the largest genome was the human at 3e9?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I dropped Storable long before I got to building the all structure. I only tested it on something like ~4GB.

      Anyway, I will try to re-think about my needs etc. and will probably come back for more help.

      p.s.

      actually, we're not the largest. Some plants have larger genomes, and some amoebozoans have genomes up to 200 times larger than the human genome! I guess size does not (always) matter...

        actually, we're not the largest. Some plants have larger genomes, and some amoebozoans have genomes up to 200 times larger than the human genome! I guess size does not (always) matter...

        Plants, they're all cargo-cult copy pasta, change one char, copy pasta :)

      And the largest single file (ext3) is 16 Terabytes,
      Switch to ZFS, and your largest file size increases a million fold to 16 Exabytes. There are of course some practical considerations.... (it won't fit on a memory stick!)
        ext4 allows 1 Exabyte, Btrfs also 16 Exabyte. If one doesn't want to switch to Solaris or external storage.