Storing large data structures on disk

roibrodo has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Storing large data structures on disk by moritz (Cardinal) on May 31, 2010 at 15:04 UTC
The common technique for large datasets is to use a database. In the case you describe a simple CSV file would seem to do it too, and will probably use less memory in the process. Or maybe a compatct, home-made binary format, to avoid excessive disc usage. pack and unpack for the win. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^2: Storing large data structures on disk by roibrodo (Sexton) on May 31, 2010 at 15:36 UTC
Thank for the reply. I followed your hint and read perlpacktut. Since my array is 2d and each row has a different number of values (columns), doesn't it mean it will require a lot of programming overhead defining the templates? Also, I would prefer a generic method that will work for different data structures (e.g. hashes of ...). I will continue on reading om pack and unpack and will appreciate any pointers for some examples to the use of the for the purposes I described.	[reply]
Re^3: Storing large data structures on disk by moritz (Cardinal) on Jun 01, 2010 at 07:55 UTC
Since my array is 2d and each row has a different number of values (columns), doesn't it mean it will require a lot of programming overhead defining the templates? It can be as simple as representing each row by an integer, which holds the number of numbers in that row, and then the row in fixed with (for example always 4 byte per number). But that all depends on how flexible you want the data retrieval to be, how robust it must be etc.	[reply]
Re: Storing large data structures on disk by BrowserUk (Patriarch) on May 31, 2010 at 16:59 UTC
This takes less than 3 minutes each to store and then retrieve 3.2 GB of AoA to disk (1e6 arrays of ave. 100 integers): Update: If I use `-O=5.8` which translates to somewhat over 630,000 arrays of ave:100 elements, but avoids pushing my machine into swapping, the time is 10 seconds to write and 11 to read. #! perl -sw use strict; use Data::Dump qw[ pp ]; use Time::HiRes qw[ time ]; $\|++; our $O //= 3; my @AoA; $#AoA = 10 ** $O; $AoA[ $_ ] = [ 1 .. 1+rand 200 ] for 0 .. $#AoA; pp \@AoA if $O <= 2; my $start = time; open O, '>:raw', 'junk26.bin' or die $1; for ( 0 .. $#AoA ) { printf O pack 'V/A', pack 'V', @{ $AoA[ $_ ] }; ## switched prin +tf to print } close O; printf "Store took %.6f secs\n", time() - $start; @AoA = (); $start = time; open I, '<:raw', 'junk26.bin' or die $!; for ( 0 .. 10 ** $O ) { read( I, my $n, 4 ); read( I, my $buf, unpack 'V', $n ); $AoA[ $_ ] = [ unpack 'V*', $buf ]; } close I; printf "Retrieve took %.6f secs\n", time() - $start; pp \@AoA if $O <= 2; __END__ C:\test>junk26 -O=6 Store took 169.778000 secs Retrieve took 170.926000 secs [download] Resultant file is 400 MB on disk and gzips to 6 MB. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^2: Storing large data structures on disk by roibrodo (Sexton) on May 31, 2010 at 17:22 UTC
I have to admit this code is too complex for me - too many shortcuts I'm unfamiliar with, but I'm doing my best to understand it :) (I guess I should move to the newbies section...) Anyway, I can't get it to run - I get: Bareword found where operator expected at test2.pl line 18, near "prin +tf O "%s", pack 'V/A" (Might be a runaway multi-line // string starting on line 8) (Do you need to predeclare printf?) Bareword found where operator expected at test2.pl line 18, near "', p +ack 'V" (Missing operator before V?) Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. Global symbol "$start" requires explicit package name at test2.pl line + 8. Global symbol "@AoA" requires explicit package name at test2.pl line 8 +. syntax error at test2.pl line 18, near "printf O "%s", pack 'V/A" Bad name after raw' at test2.pl line 26. [download]	[reply] [d/l]
Re^3: Storing large data structures on disk by BrowserUk (Patriarch) on May 31, 2010 at 18:20 UTC
Anyway, I can't get it to run As others have explained, switch `our $O //= 2` to `our $O \|\|= 2` for pre-5.10 perls. I have to admit this code is too complex for me - too many shortcuts If you have specific questions about particular lines of code, just ask. That is what this place is for. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^4: Storing large data structures on disk by roibrodo (Sexton) on May 31, 2010 at 19:37 UTC
Re^5: Storing large data structures on disk by BrowserUk (Patriarch) on May 31, 2010 at 20:16 UTC
Some notes below your chosen depth have not been shown here
Re^3: Storing large data structures on disk by ikegami (Patriarch) on May 31, 2010 at 17:56 UTC
`//` was introduced in Perl 5.10. Replace `our $O //= 3;` [download] with `our $O \|\|= 3;` [download]	[reply] [d/l] [select]
Re^3: Storing large data structures on disk by Anonymous Monk on May 31, 2010 at 17:34 UTC
Did you use the download code link? You probably need perl 5.10 because of defined-or operator (//=).	[reply]
Re: Storing large data structures on disk by jethro (Monsignor) on May 31, 2010 at 15:46 UTC
That store/nstore is slower than printing is easy to explain: store has to cope with arbitrary data while you with intimate knowledge about the data structure knew that there would be just columns of integers to process That knowledge is also your biggest advantage. You know the data, you know the access you need. If for example all numbers are below 256, each number can be stored in 1 byte. If the array is sparse (i.e. has mostly 0s), you could store only numbers other than 0 and their position. Or if numbers often are repeated, compress them to a count and the number. A compression rate of 100 suggests either one of these or repeated occurences of sequences of numbers. In that case a compression algorithm like Compress::Bzip2 should get good results What compression did you use with freeze? There seems to be no indication in the documentation of Storable that freeze offers any sophisticated compression.	[reply]
Re^2: Storing large data structures on disk by roibrodo (Sexton) on May 31, 2010 at 17:11 UTC
I also didn't find any documentation that freeze offers any compression, so I thought to first freeze than compress in memory, but this did not prove to be a good idea. Yes, the numbers repeat themselves and Bzip2 works very well. The problem is the step before Bzip2, i.e. how to get the data to disk without using so much memory. I do not understand why nstore uses so much memory. Do you think using MLDBM will be a good option? I just read about it but I'm not sure. The thing is I will have to load many such structures and use them at once. I wonder what would be the best solution...	[reply]
Re^3: Storing large data structures on disk by jethro (Monsignor) on May 31, 2010 at 23:05 UTC
Storable is probably not optimized for low memory consumption. Also since you freeze and then use bzip on that (did you use a pipe for that, i.e. the command line utility bzip2?) you probably have the unfrozen, the frozen and the compressed in memory at the same time A more memory conserving algorithm would for example freeze one column, push that into the compression pipe and clear that column. Then on to the next column Whether MLDBM is a good option depends very much on the data and what you need to do with it. For example if you need random access to some but not all of the array elements (or some columns but not all) a database like NDLBM would be an execellent solution. If instead, as you seem to indicate, your processing needs always all of the data (for a fourier transform or a matrix mulitplication or ...) a database is a waste of time. If instead you are searching for patterns, it might be possible to do some preprocessing and store the data as a hash with all possible subpatterns as keys and the locations as data. You see there are many possibilities. PS: You are aware that a few million rows of in average 100 numbers already use 2G memory? Perl stores lots of internal information about a variable	[reply]
Re: Storing large data structures on disk by JavaFan (Canon) on May 31, 2010 at 15:12 UTC
It might be worth mentioning that once on disk, the binary file cab be highly compressed (~1:100), but I was not able to figure out if I can use this nice property Print to a pipe that does the compression? For instance: `open my $fh, "\| gzip -9 > file.gz" or die $!;` [download]	[reply] [d/l]
Re: Storing large data structures on disk [database] by erix (Prior) on Jun 01, 2010 at 11:58 UTC
The first answer you got suggested using a database. Databases are designed to adapt to larger-than-memory datasets. Why not use a database? (e.g. http://www.postgresql.org/). (I'm also a little curious what data you are dealing with: sequencing, microarrays, snp's? .cab files? Does that mean, windows? Does that mean desktop environment (as opposed to $BigServer?) Affymetrix? Anyway, it may not be essential to know but it may make it more interesting.. )	[reply]
Re: Storing large data structures on disk by trwww (Priest) on Jun 02, 2010 at 02:26 UTC
For the last few hours I was trying different ways of storing (serializing) a large data structure to disk. Just to make sure you understand perfectly clear because you seem to be ignoring the advice: the tool that a competent engineer reaches for when presented with this task is a database. Otherwise, all you are doing is (poorly) reinventing a database. Even if you pull off this particular task, the very next thing you are going to be asked to do with the data is something that is going to be very simple with a database, but extremely difficult or impossible to implement in your custom data format.	[reply]
Re^2: Storing large data structures on disk by roibrodo (Sexton) on Jun 02, 2010 at 08:01 UTC
Yest, I have starting digging into it. any recommended pointers for database newbies?	[reply]
Re^3: Storing large data structures on disk by planetscape (Chancellor) on Jun 04, 2010 at 23:11 UTC
Our very fine Tutorials, of course, contain a section called Database Programming. I heartily recommend Database Design for Mere Mortals, 2nd Ed. I also suggest you take a look at the superlative course notes offered by Perl Training Australia, which include notes that complement Programming the Perl DBI. HTH, planetscape	[reply]
Re^3: Storing large data structures on disk by Marshall (Canon) on Jun 03, 2010 at 07:58 UTC
I think you are headed towards a commercial relational database. re: This is because many nucleotides may point to the same results (they are highly depended). Your dataset size is huge, but databases are good at saying stuff "X" belongs to both "A" and "B". As just a simple primer and a "learn by doing with MySQL", try Learning SQL by Alan Beaulieu. This is just basic introductory stuff, but will give the bare basics of how relational tables interact. A huge commercial database will use SQL for the queries. From what I've read above, your representation of the dataset just isn't going to work because no realistic even network of machines can implement this. You are going to need multiple "serious machines" and a lot more thought about the organization of the data and what you need to process and algorithms to process the data. From your original post, I see 10GB of actual data. Other estimates I saw above are vastly greater. One approach would be to design for 10x what you have now and get that working (100GB). Jumping 2,3 or more orders of magnitude past the data you have now is unlikely to be successful (in my experience, 100x is often too large of a technology leap for "one shot"). Do 10x, learn stuff, then do another 10x.	[reply]