in reply to Storing large data structures on disk

That store/nstore is slower than printing is easy to explain: store has to cope with arbitrary data while you with intimate knowledge about the data structure knew that there would be just columns of integers to process

That knowledge is also your biggest advantage. You know the data, you know the access you need.

If for example all numbers are below 256, each number can be stored in 1 byte. If the array is sparse (i.e. has mostly 0s), you could store only numbers other than 0 and their position. Or if numbers often are repeated, compress them to a count and the number. A compression rate of 100 suggests either one of these or repeated occurences of sequences of numbers. In that case a compression algorithm like Compress::Bzip2 should get good results

What compression did you use with freeze? There seems to be no indication in the documentation of Storable that freeze offers any sophisticated compression.

  • Comment on Re: Storing large data structures on disk

Replies are listed 'Best First'.
Re^2: Storing large data structures on disk
by roibrodo (Sexton) on May 31, 2010 at 17:11 UTC

    I also didn't find any documentation that freeze offers any compression, so I thought to first freeze than compress in memory, but this did not prove to be a good idea.

    Yes, the numbers repeat themselves and Bzip2 works very well. The problem is the step before Bzip2, i.e. how to get the data to disk without using so much memory. I do not understand why nstore uses so much memory.

    Do you think using MLDBM will be a good option? I just read about it but I'm not sure. The thing is I will have to load many such structures and use them at once. I wonder what would be the best solution...

      Storable is probably not optimized for low memory consumption. Also since you freeze and then use bzip on that (did you use a pipe for that, i.e. the command line utility bzip2?) you probably have the unfrozen, the frozen and the compressed in memory at the same time

      A more memory conserving algorithm would for example freeze one column, push that into the compression pipe and clear that column. Then on to the next column

      Whether MLDBM is a good option depends very much on the data and what you need to do with it. For example if you need random access to some but not all of the array elements (or some columns but not all) a database like NDLBM would be an execellent solution. If instead, as you seem to indicate, your processing needs always all of the data (for a fourier transform or a matrix mulitplication or ...) a database is a waste of time. If instead you are searching for patterns, it might be possible to do some preprocessing and store the data as a hash with all possible subpatterns as keys and the locations as data. You see there are many possibilities.

      PS: You are aware that a few million rows of in average 100 numbers already use 2G memory? Perl stores lots of internal information about a variable