in reply to BerkleyDB versions 2.x , 3.x

While I have not used it extensively, I did do some research into it over a year ago.

If you give some details of what you need to do with it, I might have something concrete to say. As it stands all that I can tell you is that it is written by the same person who wrote DB_File, it is fast, it can handle a lot of data, that data is only directly accessible on one machine, there are some locking gotchas I know about...

Beyond that, if you want to learn more I strongly recommend starting here and reading their documentation. Even though I eventually found that it was not appropriate for what I wanted to do, I learned a lot of useful stuff (both practical and theoretical) from that documentation...

  • Comment on Re (tilly) 1: BerkleyDB versions 2.x , 3.x

Replies are listed 'Best First'.
Re: Re (tilly) 1: BerkleyDB versions 2.x , 3.x
by chorg (Monk) on Apr 30, 2001 at 17:39 UTC
    Thanks - I'm readomg the docs today...

    Basically we've got a clustered web server setup, but only one database server. I would like to distribute some of the data load, such as authentication for users, individual site data etc over more than one data management system. I'm not a great fan of DBMs' but when I saw what was possible with Berkeley 3.x, I was enthusiastic.

    What did you ment when you said that the data is only directly accessible from one machine?

    The locking gotchas that you know about - what are they?
    _______________________________________________
    "Intelligence is a tool used achieve goals, however goals are not always chosen wisely..."

      The locking gotchas first.

      Do not follow the Cookbook and the old DB_File docs. Never flock the handle to the dbm. If you must use flock, flock an external lock-file.

      Next the one machine issue. Take a look at this list of what Berkeley DBis not. It is an access library. Furthermore it is an access library that requires shared memory. As they point out that makes it important to put Berkeley DB on a local filesystem.

      That means that a given dbm is only directly accessible from one machine. You can have a client-server relationship (eg LDAP) so that the data can be indirectly accessed from multiple machines though. I have never tried that. Plus since the library is mapping things into the process that is using it, recovering from unexpected application failure in a CGI environment is a non-trivial affair. (You *never* know when someone else is coming along, and race issues are a far bigger problem in a web environment than in traditional applications. When I last checked, admittedly a while ago, Berkeley DB was still catching up.)

      In your situation this means that you likely will want to limit your dbm usage to lifting read-only load. Unless your clustering solution allows you to reliably send a client back to the same machine until you have synchronized data, you really don't want to use it for read/write. You could, of course, stage read/write information to a local dbm and then transfer to a permanent record later.

      Another significant detail that I discovered at the same time. All of the transactional guarantees that people give you with databases? To get them to work with Linux you must be using Linux 2.4, and you must have your data on a raw IO partition. Otherwise there is a layer of buffering at the Linux filesystem level which means that the database does not really know what has and has not hit disk. For most purposes this does not matter, but if you have a hard reliability limit to hit, you should be aware of this. (I do not think that Linux is alone in having obscure limits like this, I just happen to know for that OS what they are.)