davidnicol has asked for the wisdom of the Perl Monks concerning the following question:

Occasionally I want to work with a big data set, too big for even the humongous memory on my late-model COTS supercomputer.

Now is such a time, and I am considering writing an Inline::C module that will mmap my massive fixed-length-record data set and provide an @-overloaded or otherwise magical object that will look like an array, to perl, but will pull fixed-length records out of my file. Surely this has been done already; use of Sys::Mmap and substr would take me pretty much there, something like

use Sys::Mmap; use constant BDfname=>"bigdatafile"; use constant BD_RECORDSIZE=>9876; -s BDfname % BD_RECORDSIZE and die"size ne mult of recsize"; open BIGDATA, BDfname or die $!; mmap( $BD, -s BDfname, PROT_READ, MAP_SHARED, BIGDATA ); sub BDrec($){ substr($BD,BD_RECORDSIZE*$_[0],BD_RECORDSIZE) }
might work. But still,

My question is, what would happen if perl_mymalloc was replaced with something that created and memory-mapped files instead of doing mallocs?

Would I get a fully checkpointable perl, able to gracefully deal with larger-than-memory data sets without eating up all my swap (of course it might still thrash, but not on the designated swap space)?

How severe would be the speed hit? (before hitting the memory limit)

Has this been done already?

Replies are listed 'Best First'.
Re: implications of mmapping the whole enchilada
by rhesa (Vicar) on Oct 20, 2006 at 21:37 UTC
    Have you considered using Tie::File instead? You can treat your file as an array, and Tie::File will read the appropriate lines for you behind the scenes.

    I have never tried using Tie::File with fixed-length records, so I don't know if recsep => \$record_length would Do The Right Thing. But since you can use Tie::File on an already-opened filehandle, you can work around that if necessary. Open the file like you normally would, and set local $/ = \1234; to change the handle from line-oriented to record-oriented.

Re: implications of mmapping the whole enchilada
by perrin (Chancellor) on Oct 21, 2006 at 04:15 UTC
    I think you should consider using BerkeleyDB's array-like interface, or some similar dbm solution. It does a good job with the memory management for you, and can be hidden behind a tied array if you like.
Re: implications of mmapping the whole enchilada
by duckyd (Hermit) on Oct 20, 2006 at 21:27 UTC