in reply to Re^4: Storing large data structures on disk
in thread Storing large data structures on disk

  1. 1. When I run your code with passing -O=6, for example, it also prints the structure to the screen.

    It should only dump the structure to the screen if -O=2 or less? See the lines that end in if $O <= 2;. There is something wrong with your copy of the code if this is not the case?

    I added that so that I could quickly check that what got unpacked was the same as what was packed. For small examples only.

  2. What is the meaning of pp here?

    If you look a the third line of code you'll see: use Data::Dump qw[ pp ];; pp in this case stands for "pretty print" and is Data::Dump's equivalent of Data::Dumper's Dumper() function.

  3. Can you explain the heart of the packing:  printf O "%s", pack 'V/A*', pack 'V*', @{ $AoA[ $_ ] };;

    Okay. First off update your copy of the code from the original node where I've switched it from printf to print.

    The guts of the thing is two calls to pack.

    • pack 'V*', @{ $AoA[ $_ ] };

      It goes through the array: @AoA (with $_ set to 0 .. $#AoA) one element at a time getting the reference to the sub-array.

      The @{ ... } bit expands the array reference to the contents of that sub-array.

      The pack format "V*" say pack all the values in the list (produced above), as unsigned integers into a binary string and return that string.

    • pack 'V/A*', ...

      The second pack template "V/A*", says return the input binary string ("A*") prefixed ('/') with a 32-bit unsigned integer ('V').

      And the print writes that out to the file.

    As your sub-arrays are variable sized, we need the prefix count so that we know how much of the file to read back into each sub-array when retrieving it.

    Note: You might prefer to use 'N' rather than 'V' if that is more natural on your platform.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^6: Storing large data structures on disk
by roibrodo (Sexton) on Jun 01, 2010 at 04:32 UTC

    Thanks again. One last question for now (I guess I added it to the previous post while you were replying) : when the ds becomes too large to store it all in memory, is tying with MLDBM the preferred paradigm? What are the alternatives?

      when the ds becomes too large to store it all in memory, is tying with MLDBM the preferred paradigm? What are the alternatives?

      It really depends upon what is in the data structure and how you are using it. Some of the questions that might influence the best choice are:

      • How long does the application(s) that use it run for?

        A web app that runs hundreds or thousands of times an hour for a few milliseconds each time might require a different solution to a data processing app that loads once a day or week and runs for minutes or hours each time.

      • How much of the data structure is used in each run?

        If only one or two rows are used per run, it might make more sense to leave the data on disk and structure the file such that it can be randomly accessed.

      • If a large proportion of the data is used for each run, is the structure traverse once or multiple times? Is it traversed serially or randomly?
      • Is the data structure read-only, occasionally written or frequently written?
      • Does only one application use the data concurrently? Or many?
      • Can the processing of the large data set be encoded using SQL to reduce it to producing small results sets?

      The best solution may depend upon the answers to some or all of these questions, and more that arise from those answers. A clear description of the data set and how it is used would be the quickest way of eliciting good answers.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Here is a shot description of my application and its intended use

        .

        The data structure holds results of some biological experiments. The main array of the AoA represent a genome, each subarray holds the results for a specific location in the genome ("nucleotide"), hence the size of the main array might be very large (up to ~10^10 "nucleotides").

        Each such subarray holds a list of measurements referring to this nucleotide. The results themselves are stored in a separate array @res (as some objects) and what I keep in the each subarray of my AoA are the indices to those result in other array @res. This is because many nucleotides may point to the same results (they are highly depended). This way, when I want to get the results for some nucleotide, I go to its location in the AoA and follow the list of indices specified there to pull the objects out of @res.

        This is not supposed to work as a web app. The normal usage will be to focus on some specific region (a range of nucleotides) than pull out there results and do something with it (this "something" can be many things). So, yes, the data will usually be taken out of the large AoA in arbitrary chunks. The size of the chunks may vary but will usually be less than 5% of the entire dataset. The dataset is written once and can then become read-only.