Mic has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a question on the size of large data sets with respect to BerkeleyDB. I am trying to load a hash table with a data set of over 22 million unique tokens. I have tied the hash table using the BerkelyDB::Hash module. The problem I am having is that it took 3 days to load only 15% of the data. I was curious if anyone else has tried to load that much data using BerkeleyDB (or any other method) and succeeded in a reasonable amount of time. Thanks!

Replies are listed 'Best First'.
Re: BerkeleyDB and large datasets
by diotalevi (Canon) on Jun 25, 2003 at 16:41 UTC

    You should go to Sleepycat's website and read up on the documentation paying special attention to the C API (since that what Paul's BerkeleyDB module wraps). I would suspect you should consider altering some of the default caching strategies or another of the tuning options. Be sure not to miss some goodies like Access method tuning.

Re: BerkeleyDB and large datasets
by fglock (Vicar) on Jun 25, 2003 at 16:21 UTC

    I had a similar problem some time ago. My bottleneck was not having enough RAM. You can check this by using the top utility while your program is running: if your process is swapping you will see a "D" instead of "R" in the process status, and the CPU use will be very low.

Re: BerkeleyDB and large datasets
by shotgunefx (Parson) on Jun 25, 2003 at 20:28 UTC
    I had a similar problem. I had around 10 million entries for a search engine I wrote. Are you accessing some of the keys more than once when loading the data? If so, you can save a ton of time by presorting the data. I went from 8 hours to 35 minutes.

    -Lee

    UPDATE
    Another note, if you're using keys() on the hash, it will actually build a list of all the keys which will certainly thrash your box. Which is of course the correct behaviour.
Re: BerkeleyDB and large datasets
by The Mad Hatter (Priest) on Jun 25, 2003 at 16:13 UTC
    Is there a good reason why you can't use a real database, such as Postgres or Oracle (which is commericial)? It would doubtlessly make your job much easier...

    Update Please excuse my ignorance... ; )