in reply to Re^7: Storing large data structures on disk
in thread Storing large data structures on disk

Here is a shot description of my application and its intended use

.

The data structure holds results of some biological experiments. The main array of the AoA represent a genome, each subarray holds the results for a specific location in the genome ("nucleotide"), hence the size of the main array might be very large (up to ~10^10 "nucleotides").

Each such subarray holds a list of measurements referring to this nucleotide. The results themselves are stored in a separate array @res (as some objects) and what I keep in the each subarray of my AoA are the indices to those result in other array @res. This is because many nucleotides may point to the same results (they are highly depended). This way, when I want to get the results for some nucleotide, I go to its location in the AoA and follow the list of indices specified there to pull the objects out of @res.

This is not supposed to work as a web app. The normal usage will be to focus on some specific region (a range of nucleotides) than pull out there results and do something with it (this "something" can be many things). So, yes, the data will usually be taken out of the large AoA in arbitrary chunks. The size of the chunks may vary but will usually be less than 5% of the entire dataset. The dataset is written once and can then become read-only.

  • Comment on Re^8: Storing large data structures on disk

Replies are listed 'Best First'.
Re^9: Storing large data structures on disk
by planetscape (Chancellor) on Jun 04, 2010 at 23:14 UTC
    The data structure holds results of some biological experiments.

    Might Perl and Bioinformatics be of interest to you?

    HTH,

    planetscape

      Is the planetscape userid a bot? Cos matching on "bio" isn't useful.

        Is the planetscape userid a bot?

        No, I'm not that efficient. ;-)

        HTH,

        planetscape
Re^9: Storing large data structures on disk
by BrowserUk (Patriarch) on Jun 01, 2010 at 08:01 UTC

    How big is @res?

      @res is in the order of 10^5-10^6. Each result "object" is just a small hash with just a few fields in it.

      Now that I think about it again, I might want to add an extra level to my AoA - each nucleotide could have multiple (10^3) different list of experiments, not only one set. The reason is we also simulate "random experimental results" so we will actually have multiple @res arrays.

        And is 200 an upper limit to the number of indices for one nucleotide for one experiment?