in reply to exploring XP DBI DBD DBM SQL RDMS MySQL Postgres

A couple of notes supporting your choice. First of all AgentM is definitely in the minority in disliking DBI. In fact in the same thread you have DBI is NOT dying by merlyn, who is (*ahem*) somewhat better known in the Perl community.

But as for DB_File, Its size limits are the limit of a file in your filesystem, the amount of space you have, or (depending on configuration parameters) 100 terabytes. Size is simply not an issue, though with large data sets you are strongly advised to go with a BTree rather than a hash.

However there are two large drawbacks with using an interface to Berkeley DB. The first is that (as you found) it has a very simple data model. It will handle a lot of data, but it does not offer (yet) SQL or other complex data structures. (Though I heard rumor that they were considering adding this, but I don't know whether they have.) The second is that it maps the library directly into your process. This makes recovery and restoration of an active system hard. It also makes it hard to have a system with multiple webservers whose data needs to be kept in sync. (You cannot mount the database over NFS.)

And finally about postgres. If you want a full-featured open-source database now, I would recommend it. And while I have not played with it, I understand that not only does it have stored procedures, but you can even choose to write them in Perl. It will not offer the raw performance of MySQL or Berkeley DB. (The price you get for features and safety.) Heavy database folks tell me that its optimizer is not as good as Oracle. But then again the person who mentioned that to me also tends to load 50 GB of data into his database every week. I doubt that your needs are quite so extreme...

  • Comment on Re (tilly) 1: exploring XP DBI DBD DBM SQL RDMS MySQL Postgres

Replies are listed 'Best First'.
Re: Re (tilly) 1: exploring XP DBI DBD DBM SQL RDMS MySQL Postgres
by mandog (Curate) on Sep 04, 2001 at 03:42 UTC
    Thanks for the reply.

    Could you offer more detail on what you mean by mapping "maps the library directly into your process" ? I think I get the gist of it and it doesn't seem good but I'm having trouble visualizing the details...

    btw, there is some evidence that Postgres does relatively well under real-world loads... I read somewhere that SourceForge switched from MySQL to Postgres because performance in the current Postgres is acceptable. I can't find the link though.



    --mandog

      Here is what I mean.

      A traditional relational database works on a client-server model. There is a server process running, and clients talk to the server. By contrast Berkeley DB does not run this way (though they likely have that as an option by now). Instead each process that wants to use the data loads up the access routines, attaches to a shared memory segment, and proceeds to fetch data. The shared memory segment is how simultaneously connecting processes cooperate and make sure that if one is writing, and another is reading, that you don't have problems.

      This imposes three big limitations.

      The first is that all processes connecting to the database have to be on the same machine, connecting to the same shared memory segment.

      The second is that you cannot have much of a security model for your data. Each client has direct access to all of the data if it wants it.

      The third is that you need to externally manage when you allow processes to connect. For instance if someone kills a process that is interacting with the dbm, the dbm is likely left in an inconsistent state. There is no way for anyone to detect this automatically. To recover from it you need to make sure that nobody is going to connect to the database, then proceed to have a single process repair it. While the facilities for that exist in the library, in a web environment it is up to you to make sure that all web processes will coordinate on a single lock-file to know not to attempt to access the database.

      The client server model involves a lot more up front overhead, but suffers from none of the above deficiencies. The third in particular is why when I investigated dbms a couple of years ago, I decided that a dbm was not appropriate for any critical data in a web environment.