To paraphrase Phillip Greenspun, "A few databases would make our website more useful." Not noticing this excellent thread, I set out to figure out whether or not to go the RDMS or the file system route.

The mod_perl people seem to think simple file based system like DB_File can handle 5,000 to 10,000 rows --larger than I can imagine any of our tables getting.

As a nice plus, DB_File comes with Perl on LINUX where I would use it. Given this information and these two tenets of XP

...I thought I was set. Thinking a little more, I realize I'm going to want to use SQL almost right away and DB_File doesn't do SQL. (right?)

So, back to the drawing board... This level 10 person has a strong negative opinion of DBI. However, the DBI clan looks like a happy and helpful group. This node by trantor touches on avoiding the problems switching between different databases comforts me. (Even though I am not very clear on how s/// and the database independent  quote method are different.)

For those not yet part of the clan, DBI translates to "Database interface" and is a generic way to access a large number of file and server based databases from perl. DBD translates roughly to "database driver". For each database, (Oracle,Sybase, Postgres, MySQL, etc) there is a seperate DBD module. DBM refers to Berkley Database files see DB_File.

I'm still hoping "SQL" ne "database server", so I look at the database category on CPAN and search perlmonks. A bunch of people say nice things about DBD::RAM. I come across this helpful node by anonymous (but wise) monk pointing out that DBD::AnyData replaces DBD::RAM

But wait, DBD::AnyData doesn't support joins! and my first query is likely to be something like:

SELECT users.name, AVG(tests.score) FROM users, tests WHERE users.name=tests.name GROUP BY users.name ORDER BY users.name;
I guess I need a database server after all. The two contenders in my price range (free) seem to be MySQL and Postgres There seem to be more people using MySQL, but a year ago, the OpenACS people had some mean things to say about MySQL. It does look like there have been improvements. However, useful stuff like stored procedures, triggers and nested queries don't look like they are there yet.

While the feature list for DBD::pg doesn't include access to stored procedures....????... It looks like Postgres is the database for me...

(Barring counter arguments from my wiser peers...)

--mandog

Edit: chipmunk 2001-09-03

update: changed "qoute" to "quote" removed extra (wrong) comma

Replies are listed 'Best First'.
Re (tilly) 1: exploring XP DBI DBD DBM SQL RDMS MySQL Postgres
by tilly (Archbishop) on Sep 04, 2001 at 00:38 UTC
    A couple of notes supporting your choice. First of all AgentM is definitely in the minority in disliking DBI. In fact in the same thread you have DBI is NOT dying by merlyn, who is (*ahem*) somewhat better known in the Perl community.

    But as for DB_File, Its size limits are the limit of a file in your filesystem, the amount of space you have, or (depending on configuration parameters) 100 terabytes. Size is simply not an issue, though with large data sets you are strongly advised to go with a BTree rather than a hash.

    However there are two large drawbacks with using an interface to Berkeley DB. The first is that (as you found) it has a very simple data model. It will handle a lot of data, but it does not offer (yet) SQL or other complex data structures. (Though I heard rumor that they were considering adding this, but I don't know whether they have.) The second is that it maps the library directly into your process. This makes recovery and restoration of an active system hard. It also makes it hard to have a system with multiple webservers whose data needs to be kept in sync. (You cannot mount the database over NFS.)

    And finally about postgres. If you want a full-featured open-source database now, I would recommend it. And while I have not played with it, I understand that not only does it have stored procedures, but you can even choose to write them in Perl. It will not offer the raw performance of MySQL or Berkeley DB. (The price you get for features and safety.) Heavy database folks tell me that its optimizer is not as good as Oracle. But then again the person who mentioned that to me also tends to load 50 GB of data into his database every week. I doubt that your needs are quite so extreme...

      Thanks for the reply.

      Could you offer more detail on what you mean by mapping "maps the library directly into your process" ? I think I get the gist of it and it doesn't seem good but I'm having trouble visualizing the details...

      btw, there is some evidence that Postgres does relatively well under real-world loads... I read somewhere that SourceForge switched from MySQL to Postgres because performance in the current Postgres is acceptable. I can't find the link though.



      --mandog

        Here is what I mean.

        A traditional relational database works on a client-server model. There is a server process running, and clients talk to the server. By contrast Berkeley DB does not run this way (though they likely have that as an option by now). Instead each process that wants to use the data loads up the access routines, attaches to a shared memory segment, and proceeds to fetch data. The shared memory segment is how simultaneously connecting processes cooperate and make sure that if one is writing, and another is reading, that you don't have problems.

        This imposes three big limitations.

        The first is that all processes connecting to the database have to be on the same machine, connecting to the same shared memory segment.

        The second is that you cannot have much of a security model for your data. Each client has direct access to all of the data if it wants it.

        The third is that you need to externally manage when you allow processes to connect. For instance if someone kills a process that is interacting with the dbm, the dbm is likely left in an inconsistent state. There is no way for anyone to detect this automatically. To recover from it you need to make sure that nobody is going to connect to the database, then proceed to have a single process repair it. While the facilities for that exist in the library, in a web environment it is up to you to make sure that all web processes will coordinate on a single lock-file to know not to attempt to access the database.

        The client server model involves a lot more up front overhead, but suffers from none of the above deficiencies. The third in particular is why when I investigated dbms a couple of years ago, I decided that a dbm was not appropriate for any critical data in a web environment.

(ichimunki) Re: exploring XP DBI DBD DBM SQL RDMS MySQL Postgres
by ichimunki (Priest) on Sep 04, 2001 at 16:29 UTC
    Having used both a little, I see no serious differences for what I gather you are planning. However, if you intend to get into serious data analysis, or would eventually like to move past the warmed-over SQL interface to the filesystem, I would go with postgreSQL. Despite the excellent documentation from MySQL that points out alternative techniques, I think supporting basic SQL concepts like sub-selects is important.

    If you are going to be getting 100,000 hits a day, you might want to seriously investigate the speed differences with realistic tests of both systems on the same hardware. Other than that, the only reasons I can think of to go with MySQL are these: more widely used/supported, you have an application that you want to reuse that is built on it (and even with DBI, I can't imagine switching is as easy as changing a couple lines of Perl to point to a new db server in any but the most trivial applications), or you don't control the selection process (i.e. your ISP provides MySQL, but not postgreSQL). Otherwise I can't imagine why one wouldn't go with postgreSQL.
Re: exploring XP DBI DBD DBM SQL RDMS MySQL Postgres
by cadfael (Friar) on Sep 04, 2001 at 19:07 UTC
    My experience and training have hammered home the point that Decision Support (DSS) and On Line Transaction Processing (OLTP) present different challenges and loads on database servers.

    All DSS, which is read-only, really needs is good indices, logical, and physical design. A lot of RAM allocated to the data cache helps, since reading from cache is a whole lot faster than reading from disk.

    OLTP, on the other hand, needs to assure data integrity while rows are inserted, deleted, and modified. Triggers, declarative constraints, procedures, rules, defaults, and transaction processing with commits and rollbacks all come into play here, and this takes cpu cycles and well as physical reads and writes to the database.

    Concurrency and data integrity, thus, are often at loggerheads when it comes to database design. We have here the "2 out of 3" principle in operation:

    You can have it quickly;
    You can have it cheaply;
    or you can have it right --

    But you can only have two of the three.

    MySQL excels at DSS applications, but a production database where data integrity is crucial needs more than MySQL is able to offer. We use a full-featured RDBMS for our production databases, but several smaller applications are using MySQL. It's a question of using the appropriate tool for the job.

    -----
    "Computeri non cogitant, ergo non sunt"

Re: exploring XP DBI DBD DBM SQL RDMS MySQL Postgres
by mpeppler (Vicar) on Oct 05, 2001 at 22:53 UTC
    If I may offer some comments...

    At a previous job I implemented a message board type system using DB_File (although the rest of the site used Sybase). This was for a large site (several million users), but the number of messages per forum was expected to be small (i.e. several million forums, but each one quite small).

    This worked really well, until the system was bought out by another company that wanted to migrate their message boards to our code base. They had 40000 boards, some of which had several thousand messages, and requirements that differed enough from the original code to cause some serious performance problems due to the need to look through the entire DB file to get the information that was needed.

    Ouch.

    So I ended up re-implementing this code to use a relational database back-end (which was actually quite easy for this particular task).

    So - what I mean to say is that using a database engine is a good idea - it makes things easier to extend, and it allows you to make performance decisions at the database level.

    I'd also like to add that you can get Sybase 11.0.3.3 completely free for linux. It's not the latest version of Sybase, but it is stable, and offers stored procs, etc.

    Just my opinion, of course :-)

    Michael

Re: exploring XP DBI DBD DBM SQL RDMS MySQL Postgres
by shotgunefx (Parson) on Sep 04, 2001 at 18:12 UTC
    As an aside, I am currently working on a search program for a client. I opted for a tied hash and DB_File instead of DBI+MySQL for application reasons. Right now, over a million records with no problems.

    -Lee

    "To be civilized is to deny one's nature."
Re: exploring XP DBI DBD DBM SQL RDMS MySQL Postgres
by cadfael (Friar) on Sep 04, 2001 at 19:12 UTC
    > While the feature list for DBD::pg doesn't
    > include access to stored procedures....????...
    > It looks like Postgres is the database for me...

    Hmmm. Does the Postgres SQL dialect support an "exec" command?

    e.g. "exec stored_procedure"

    -----
    "Computeri non cogitant, ergo non sunt"