roibrodo has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

For the last few hours I was trying different ways of storing (serializing) a large data structure to disk. The ds is a 2d array with a few million rows, each has some 0-200 columns containing integers.

I usually use store (or nstore) and retrieve, but never tried them on large ds. Dumping the ds takes quite a long time (more than a few minutes) and worse than that - consumes almost all the memory of the machine (>97% of >10GB!). BTW, simply printing the entire array to a file takes about half of the time it takes to store it, and consumes almost no extra memory, but than I will have to parse it instead of retrieve it...

Anyway, it's quite frustrating. It might be worth mentioning that once on disk, the binary file cab be highly compressed (~1:100), but I was not able to figure out if I can use this nice property - compressing before dumping (via freezing) seems to work even worse.

Put it short - I need some robust method (I have many such ds's), that will allow storing large data structures on disk (optionally also compress them while doing that), and will not consume all memory resources and hopefully also be fast.

An ideas? This seems like quite a common task but I could find only little tips.

Thanks!

Roi

Replies are listed 'Best First'.
Re: Storing large data structures on disk
by moritz (Cardinal) on May 31, 2010 at 15:04 UTC
    The common technique for large datasets is to use a database.

    In the case you describe a simple CSV file would seem to do it too, and will probably use less memory in the process. Or maybe a compatct, home-made binary format, to avoid excessive disc usage. pack and unpack for the win.

    Perl 6 - links to (nearly) everything that is Perl 6.
      Thank for the reply.

      I followed your hint and read perlpacktut. Since my array is 2d and each row has a different number of values (columns), doesn't it mean it will require a lot of programming overhead defining the templates?

      Also, I would prefer a generic method that will work for different data structures (e.g. hashes of ...).

      I will continue on reading om pack and unpack and will appreciate any pointers for some examples to the use of the for the purposes I described.
        Since my array is 2d and each row has a different number of values (columns), doesn't it mean it will require a lot of programming overhead defining the templates?

        It can be as simple as representing each row by an integer, which holds the number of numbers in that row, and then the row in fixed with (for example always 4 byte per number).

        But that all depends on how flexible you want the data retrieval to be, how robust it must be etc.

Re: Storing large data structures on disk
by BrowserUk (Patriarch) on May 31, 2010 at 16:59 UTC

    This takes less than 3 minutes each to store and then retrieve 3.2 GB of AoA to disk (1e6 arrays of ave. 100 integers):

    Update: If I use -O=5.8 which translates to somewhat over 630,000 arrays of ave:100 elements, but avoids pushing my machine into swapping, the time is 10 seconds to write and 11 to read.

    #! perl -sw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; $|++; our $O //= 3; my @AoA; $#AoA = 10 ** $O; $AoA[ $_ ] = [ 1 .. 1+rand 200 ] for 0 .. $#AoA; pp \@AoA if $O <= 2; my $start = time; open O, '>:raw', 'junk26.bin' or die $1; for ( 0 .. $#AoA ) { printf O pack 'V/A*', pack 'V*', @{ $AoA[ $_ ] }; ## switched prin +tf to print } close O; printf "Store took %.6f secs\n", time() - $start; @AoA = (); $start = time; open I, '<:raw', 'junk26.bin' or die $!; for ( 0 .. 10 ** $O ) { read( I, my $n, 4 ); read( I, my $buf, unpack 'V', $n ); $AoA[ $_ ] = [ unpack 'V*', $buf ]; } close I; printf "Retrieve took %.6f secs\n", time() - $start; pp \@AoA if $O <= 2; __END__ C:\test>junk26 -O=6 Store took 169.778000 secs Retrieve took 170.926000 secs

    Resultant file is 400 MB on disk and gzips to 6 MB.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I have to admit this code is too complex for me - too many shortcuts I'm unfamiliar with, but I'm doing my best to understand it :) (I guess I should move to the newbies section...)

      Anyway, I can't get it to run - I get:

      Bareword found where operator expected at test2.pl line 18, near "prin +tf O "%s", pack 'V/A" (Might be a runaway multi-line // string starting on line 8) (Do you need to predeclare printf?) Bareword found where operator expected at test2.pl line 18, near "', p +ack 'V" (Missing operator before V?) Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. Global symbol "$start" requires explicit package name at test2.pl line + 8. Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. syntax error at test2.pl line 18, near "printf O "%s", pack 'V/A" Bad name after raw' at test2.pl line 26.
        Anyway, I can't get it to run

        As others have explained, switch our $O //= 2 to our $O ||= 2 for pre-5.10 perls.

        I have to admit this code is too complex for me - too many shortcuts

        If you have specific questions about particular lines of code, just ask. That is what this place is for.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Did you use the download code link? You probably need perl 5.10 because of defined-or operator (//=).
Re: Storing large data structures on disk
by jethro (Monsignor) on May 31, 2010 at 15:46 UTC

    That store/nstore is slower than printing is easy to explain: store has to cope with arbitrary data while you with intimate knowledge about the data structure knew that there would be just columns of integers to process

    That knowledge is also your biggest advantage. You know the data, you know the access you need.

    If for example all numbers are below 256, each number can be stored in 1 byte. If the array is sparse (i.e. has mostly 0s), you could store only numbers other than 0 and their position. Or if numbers often are repeated, compress them to a count and the number. A compression rate of 100 suggests either one of these or repeated occurences of sequences of numbers. In that case a compression algorithm like Compress::Bzip2 should get good results

    What compression did you use with freeze? There seems to be no indication in the documentation of Storable that freeze offers any sophisticated compression.

      I also didn't find any documentation that freeze offers any compression, so I thought to first freeze than compress in memory, but this did not prove to be a good idea.

      Yes, the numbers repeat themselves and Bzip2 works very well. The problem is the step before Bzip2, i.e. how to get the data to disk without using so much memory. I do not understand why nstore uses so much memory.

      Do you think using MLDBM will be a good option? I just read about it but I'm not sure. The thing is I will have to load many such structures and use them at once. I wonder what would be the best solution...

        Storable is probably not optimized for low memory consumption. Also since you freeze and then use bzip on that (did you use a pipe for that, i.e. the command line utility bzip2?) you probably have the unfrozen, the frozen and the compressed in memory at the same time

        A more memory conserving algorithm would for example freeze one column, push that into the compression pipe and clear that column. Then on to the next column

        Whether MLDBM is a good option depends very much on the data and what you need to do with it. For example if you need random access to some but not all of the array elements (or some columns but not all) a database like NDLBM would be an execellent solution. If instead, as you seem to indicate, your processing needs always all of the data (for a fourier transform or a matrix mulitplication or ...) a database is a waste of time. If instead you are searching for patterns, it might be possible to do some preprocessing and store the data as a hash with all possible subpatterns as keys and the locations as data. You see there are many possibilities.

        PS: You are aware that a few million rows of in average 100 numbers already use 2G memory? Perl stores lots of internal information about a variable

Re: Storing large data structures on disk
by JavaFan (Canon) on May 31, 2010 at 15:12 UTC
    It might be worth mentioning that once on disk, the binary file cab be highly compressed (~1:100), but I was not able to figure out if I can use this nice property
    Print to a pipe that does the compression? For instance:
    open my $fh, "| gzip -9 > file.gz" or die $!;
Re: Storing large data structures on disk [database]
by erix (Prior) on Jun 01, 2010 at 11:58 UTC

    The first answer you got suggested using a database. Databases are designed to adapt to larger-than-memory datasets. Why not use a database? (e.g. http://www.postgresql.org/).

    (I'm also a little curious what data you are dealing with: sequencing, microarrays, snp's? .cab files? Does that mean, windows? Does that mean desktop environment (as opposed to $BigServer?) Affymetrix? Anyway, it may not be essential to know but it may make it more interesting.. )

Re: Storing large data structures on disk
by trwww (Priest) on Jun 02, 2010 at 02:26 UTC

    For the last few hours I was trying different ways of storing (serializing) a large data structure to disk.

    Just to make sure you understand perfectly clear because you seem to be ignoring the advice: the tool that a competent engineer reaches for when presented with this task is a database.

    Otherwise, all you are doing is (poorly) reinventing a database. Even if you pull off this particular task, the very next thing you are going to be asked to do with the data is something that is going to be very simple with a database, but extremely difficult or impossible to implement in your custom data format.

      Yest, I have starting digging into it. any recommended pointers for database newbies?
        I think you are headed towards a commercial relational database. re: This is because many nucleotides may point to the same results (they are highly depended).

        Your dataset size is huge, but databases are good at saying stuff "X" belongs to both "A" and "B".

        As just a simple primer and a "learn by doing with MySQL", try Learning SQL by Alan Beaulieu. This is just basic introductory stuff, but will give the bare basics of how relational tables interact. A huge commercial database will use SQL for the queries.

        From what I've read above, your representation of the dataset just isn't going to work because no realistic even network of machines can implement this. You are going to need multiple "serious machines" and a lot more thought about the organization of the data and what you need to process and algorithms to process the data.

        From your original post, I see 10GB of actual data. Other estimates I saw above are vastly greater. One approach would be to design for 10x what you have now and get that working (100GB). Jumping 2,3 or more orders of magnitude past the data you have now is unlikely to be successful (in my experience, 100x is often too large of a technology leap for "one shot"). Do 10x, learn stuff, then do another 10x.