Questions about BerkeleyDB

lihao has asked for the wisdom of the Perl Monks concerning the following question:

Hello, monks:

I am quite new to BerkeleyDB and hope to get some guidelines from experienced developers before digging into more BerkeleyDB details for its better performance.

So far I've built several BerkeleyDB files(size from 20MB to 1.5GB in both BerkeleyDB::Hash and BerkeleyDB::Btree) and as of this writting, they are all working well.

Most of my BerkeleyDB files are imported like the following sample code:

#!/usr/bin/perl
use strict;
use warnings;

use BerkeleyDB;

my $db_file = '/path/to/lib/myapp.db';
unlink $db_file if -f $db_file;

my $bdb = tie my %tree, 'BerkeleyDB::Btree',
    -Filename => $db_file,
    -Flags    => DB_CREATE,
  or die $!;

my raw_file = '/path/to/raw.dat';
open my $fin, $raw_file 
  or die "can not open $raw_file for reading: $!";

while(<$fin>) {
    my ($key, $value) = split/\t/;
    next if not fit_condition($key);
    $bdb->db_put($key, $value);
}

sub fit_condition
{
    #.skip.#
}
[download]

For an input with about 50M key-value pairs, the above code took about 10 hours to finish building the DB file (about 17M key-value pairs, 1.1GB in file-size). My questions are:

are there some BerkeleyDB::ENV parameters that I can use to speed the importing of raw data into BerkeleyDB?
How to set the Cachesize in BerkeleyDB::ENV, given the RAM is about 16GB? how does the cache-size influence the read/write differently in BerkeleyDB? what is the proper RAM percentage to consider for Caching-purpose(in BerkeleyDB or MySQL on a DB server) if there is any:)??
How to stringify the value part if I want to save a complex data strcture instead of a plain string as the value? JSON?? or any better ways?
what is the most important factor to select between BerkeleyDB::Btree and BerkeleyDB::Hash? what else considerations from the application levels other than the algorithm levels?
Is it possible to set the BerkeleyDB read-only in some applications, so that I don't need to worry about any accidental modification on the DB data. And in case of such accidental mis-operations, how to recover the DB file besides backing up DB regularly?
Can I use the same BerkeleyDB database file generated by Perl in PHP code?

Other informtaion:

$BerkeleyDB::db_version => 4.3
$BerkeleyDB::VERSION => 0.34

Thank you for any helpful suggestions or links.

lihao

Comment on Questions about BerkeleyDB Download Code

Replies are listed 'Best First'.
Re: Questions about BerkeleyDB by perrin (Chancellor) on Jun 13, 2008 at 20:37 UTC
You can answer most of these questions yourself with some simple Google searches or a quick scan of the documentation. Here are a few answers that are perl-specific or judgment calls: 1. You can try playing with the cachesize. I think there's no way around the fact that a bunch of I/O is required. 2. You'll have to experiment to find the right cachesize. Start by giving it a lot, and then cut back. Obviously cache doesn't make difference for writes or for non-repeating workloads. 3. Use Storable. 4. BTree has always been much faster in my experience. 5. I believe there's a read-only flag. 6. Sure, as long as you use the same version of BerkeleyDB.	[reply]
Re: Questions about BerkeleyDB by TGI (Parson) on Jun 13, 2008 at 23:12 UTC
3. I'd use YAML or XML or Storable or something else of that nature. You might like to read Burned by Storable. Whether to use Storable really depends on your particular needs. What have you looked at MLDBM? TGI says moo	[reply]
Re: Questions about BerkeleyDB by starbolin (Hermit) on Jun 14, 2008 at 08:20 UTC
Poor performance is due the keys not being sorted. One million partially sorted keys on my 500MHz P4 takes 2 minutes. For random keys I only get one tenth the keys stored in that same time or 100k keys in 2 minutes 18. Extrapolating with my numbers it would take a minimum of 6 hours to store 17 million unsorted keys on my system; that's ignoring that larger BTrees take longer to access. Since most of the time is taken up sorting the keys I don't see where changes in parameters will have much effect. This is the price we pay for being able to do random access recall. The best you could do is buy a faster processor. Nevertheless, things to try are: Verify that your page size is equal to your disk page size. Set your cache size as high as possible within limits of performance for the rest of the system. Are there other users on your system? If no turn of transactions. Are you running out of memory? Look at the output from top. s//----->\t/;$~="JAPH";s//\r<$~~/;{s\|~$~-\|-~$~\|\|\|s \|-$~~\|$~~-\|\|\|s,<$~~,<~$~,,s,~$~>,$~~>,, $\|=1,select$,,$,,$,,1e-1;print;redo}	[reply] [d/l]