I have spent the last two years building and maintaining a web application. As it gets more use, some weak spots are beginning to emerge. Clearly the largest one is where the Perl scripts (both CGI and administration) prod the database, asking "Got anything for me to do?" or "Hey! Give me a list of the documents according to these parameters!"

The symptom that seems to be most obvious is that things slow down as more documents (the system has to do with document processing) are added. I have one server with about ten systems running, most with about 100 documents -- it's absolutely fine. I have another server with 6000 documents -- it's better now than it was with 9000 documents, but it's still too slow.

Apart from wholesale changes like using POE, I'd like to be able to speed things up, or even prioritize things so that the CGI gets first dibs on the database, the "Hey!" jobs are next, and the administrative grunt scripts are at the bottom of the heap, grabbing spare cycles to do garbage collection whenever no one else is busy.

This morning as I sit, drink my coffee and ponder what approach might work, I'm thinking that some sort of cascading lock might work .. if the CGI sets a lock, then the "Hey!" task and the Janitor task will see that and stay away. Likewise, if the "Hey!" task sets a lock, the Janitor will stay away. If a higher rank task runs, it doesn't care if a lower rank task has set a lock and is running.

So, this isn't a fully cooked plan yet, but I thought I would throw it out there for feedback. Thanks monks!

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Replies are listed 'Best First'.
Re: Musings about a database ecology
by Mutant (Priest) on Dec 20, 2004 at 18:05 UTC

    Your first step could be (and possibly *should* be - although I can't quite tell from your description where the bottleneck lies) optimising the database, and your queries. DB & query optimistion is a fine art, and I suggest consulting the documentation for your specific database. Most have fairly lengthy sections on optimisation (and how to view / interpret query plans, etc).

    9000 'things' isn't really a hell of a lot for most half-decent DB systems (even if the 'things' span multiple tables, requiring joins). I've run plenty of systems with many more, and (after optimising) had few issues with performance.

    Besides that, I'd also look at whether your hardware is up to the mark - that's a fairly easy thing to guague, just install some good performance statistics software.

    Application changes would be my last port of call. But even then, I'd first look at individual modules, and think about profiling them, and possibly optimising code if possible (although I wouldn't spend a whole lot of time on this, unless there's something really obvious). Also, I'd look at CPAN and see if there's something already there that's been properly optimised for what I'm trying to do (maybe even as a replacement for another CPAN module I'm already using.)

    Finally, I'd begin to consider the type of architectural changes you're talking about. Although these types of changes may be the most sure-fire way of improving performance (because they're completely customised to your specific problems), they're also the most likely to introduce other problems.

Re: Musings about a database ecology
by dragonchild (Archbishop) on Dec 20, 2004 at 16:38 UTC
    Clearly the largest one is where the Perl scripts (both CGI and administration) prod the database, asking "Got anything for me to do?"

    To me, this implies that the database contains events that need processing. Design considerations aside, I get very scared when a database is used to queue up events. IMHO, much better would be a small long-running process that accepted XML-RPC/SOAP/HTTP/protocol-of-choice connections that were of the lines of:

    • Add event (w/params)
    • Check for event(s) of type(s) X (and Y and Z)
    • Accept event N
    • Event N complete

    Then, the database (which is very expensive to talk to) is only used when you need to talk to the data (ie, the documents). Plus, your prioritization can be done by this very light-weight daemon.

    The neat thing is that you can put this on the same server as your database / application and barely feel it. A daemon like this has a very small footprint.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      The trouble is, if you want these events reliably queued, you need to store them in a database of some kind. It doesn't have to be the same one as your other data, and it doesn't even have to be relational, but it needs to be a system that can guarantee atomic operations, and survive a restart. You could write a whole separate daemon and store things yourself, but you'll probably end up writing your own version of a database daemon. I'd suggest just setting up a separate lightweight database, if the main one is already overloaded.
        but it needs to be a system that can guarantee atomic operations, and survive a restart

        I would want such a system, but talexb hasn't indicated if either of those are needed. Plus, you don't need atomicity if you only have one consumer. If the daemon is single-threaded, atomicity is undesired overhead.

        A database of some kind, even if it's just Cache, is definitely needed at some layer. I would definitely avoid the relational overhead if it's just a queue of discrete atoms.

        But, I think that saying "It needs a database" and "It must use a database as the primary queue implementation" are two separate statements. Flushing to a database is definitely important, to make sure that restarts are survivable1. However, I would use a Perl AoH as the queue.

        1. Of course, survivable is a spectrum. I would initially propose flushing to the datastore every minute or so. So, you would have up to 60 seconds of data that could be lost. Depending on the system, this may be ok or it may not be ok. (In some systems, it might even be desirable, depending on your other components.)

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

        To me, this implies that the database contains events that need processing. Design considerations aside, I get very scared when a database is used to queue up events.

      Quite. One idea to speed things up was to put 'events' into a much smaller table and use triggers on the bigger tables to add things to the small table. That way I no longer scan the big table.

      It's an idea that I hope to explore in the near future.

      Alex / talexb / Toronto

      "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      Shouldn't this event handler use some form of persistant storage?

      Michael

        The key is to reduce the overhead for processing a query. Whether or not the final persistent storage is a RDBMS or not is irrelevant - the primary storage should be a RAM-store of some sort. It could flush to a RDBMS any changes on a regular basis, say between requests. Doesn't matter much.

        The point is that talexb's problem arises out of two issues:

        1. The overhead of RDBMS requests
        2. The need to have a traffic cop

        I feel that my design would provide a lightweight solution that provides for both issues.

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Musings about a database ecology
by osunderdog (Deacon) on Dec 20, 2004 at 17:29 UTC

    The performance catch-22s:

    • #1: A server must perform well enough for it to get use, but as soon as it gets used too much the performance begins to degrade. For new systems, you may have an idea what the bottle necks are, but you won't know until the server exists (in some sense).
    • #2: You can throw a server together quickly, but it probably won't perform well over time. If you write a server that will perform well over time, it will be larger, complicated and difficult to maintain. (And critics will say that it should have been written in a faster language that has more design constructs. Not my words, their words...)
    • #3: It's very difficult to go from something that works (albeit with lower performace) and make incremental enhancements to improve performance dramatically. Sure you can optimize loops and such, but in my experience an order of magnitude performance improvement requires a design change.

    I'm in a similar position as the OP, but I haven't written my server yet. But I know for a fact that it's going to process more than 6 transactions per second, each transaction involves 3-6 individual database executions and some of those database executions takes more than 2 seconds.

      I friend of mine uses the "Field of Dreams" movie as an example. The last scene shows this wonderful baseball diamond in a corn field & lots of people arriving in cars to watch baseball. What you don't see is the scene after that where all those cars are parked willie nillie in the corn, small mobs of people are looking for something to eat (other than corn), and, of course, more than a few looking for quiet spot (in the corn) to go to the bathroom.

    Sorry, no answers, just some observations.


    "Look, Shiny Things!" is not a better business strategy than compatibility and reuse.


    OSUnderdog
Re: Musings about a database ecology
by tilly (Archbishop) on Dec 20, 2004 at 18:23 UTC
    I'll second what Mutant said: if you are running into data scalability issues at 9000 items then something is wrong in your database design. Perhaps some queries are slow. Perhaps you're manually fetching and refetching in an unfortunate way. Either way the odds are very good that an appropriate index or 3 will solve your problem for a bit with no application changes.
Re: Musings about a database ecology
by iburrell (Chaplain) on Dec 20, 2004 at 21:30 UTC
    First, as other people have said, if your database is slowing with 9000 documents, something is broken with your schema, your database server, or your code. Look at your database access code for complicated or too many queries. Look at your schema for adding indexes.

    Second, you shouldn't need to worry about locking when accessing the database. And if you do need locking, then use the database locks. The whole point of using a transactional database is to hide much of the locking and still get transactions and atomic updates. Which database server are you using?

      I'm using PostgreSQL, and it was only after I took all the locks out six months ago that we started getting reasonable throughput .. now it seems to have bogged down again in a similar way.

      I've installed Devel::Profile but it's having some strange complication with Log::Log4Perl that I don't quite understand, and dumping out of the processing loop instead of hanging around and sending off its "Hey!" messages at regular intervals.

      Alex / talexb / Toronto

      "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

        You *are* using the vaccuum command and the analyze commands, right? Just wondering, because I felt I knew PostgreSQL after using it for years, and then realized I wasn't using these things correctly. In either case, it sounds like you are not 100% sure yet about where the bottleneck is, so you should probably check out the SQL queries that you are using, etc. did "Explain" give you any info? Are your most-often used queries using the correct indices? Would you benefit if you create a PL/PgSQL function?

        As others have hinted at, I'd suggest some time in a PostgreSQL forum and optimize the database.

        Note, this may include changes to your code so it presents the queries in a manner PostgreSQL likes. I don't know anything about PostgreSQL so I can't tell you if bind_param and the other things like that (I assume you are using DBI?) have any affect. In Oracle land, you want to make sure you don't cause the database to have to re-parse the query with every hit so you use place holders in the query text and use bind_param (I think that is the method call) to set the proper variables. Each database has different parts of the query that can be set in this way.

        Other issues would also be database specific such as indexing, type of indexes, computing statistics for cost based optimization vs rule based optimzation that would require you to rearrange your where clause, etc. Also look to PostgreSQL specific features that may help. Oracle has a built in queue that may be more what you want for dealing with events. Others mentioned triggers that may or may not be able to call out to the operating system. You may also be able to (with the proper tools) profile your SQL directly in PostgreSQL to make sure your queries are not particularly heinous.

        Some perlish things to consider (but I'd do all the above first) would be to make sure you are not wasting time disconnecting and reconnecting to the database if these things are hitting the database every second or so (Oracle takes up to three seconds to connect and disconnect). If you have a lot of these, some sort of connection pooling may be in order, but again, that depends on how these are being initiated and what PostgreSQL likes and doesn't.

        Generally speaking, home grown locking mechanisms will just hurt you in ways you won't enjoy...

        I had a closer look at the queries that I've been doing, and it turns out that the query is searching through a 180,000 row table every twenty seconds or so. it's taking 12 seconds to do that .. so now I have a much better idea of where to optimize, and a temporary table holdnig 'jobs to be done' is looking a lot better.

        Thanks for your feedback.

        Alex / talexb / Toronto

        "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Musings about a database ecology
by BUU (Prior) on Dec 20, 2004 at 17:18 UTC
    Sorry, but after reading all of that I'm a tad confused. You say
    I have another server with 6000 documents -- it's better now than it was with 9000 documents, but it's still too slow.
    Then you go on to talk about "process locks" and so forth. What I don't understand is, exactly *which* part is too slow? The database server? How you access the database? The CGI scripts?

      I can't be too specific about how the whole thing works, unfortunately, but basically the problem is that the system works with documents, and one system was very slow with 9000 documents; I deleted some of the documents (and their related objects) to get it down to about 6000, and the performance improved, but only marginally.

      I am considering locks so that the various processes tread on each other as little as possible. And all of the parts of the big system are too slow, considering that a smaller system with hundreds of documents is acceptably fast.

      Alex / talexb / Toronto

      "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

        Perhaps it's just a function of the vagueness required by whatever job you happened to be in, but everything you say sounds like a severe case of premature optimization. You *sound* like you have no idea what is slowing down the system so you'll just implement some kind of locking system to hopefully make it go faster?! But maybe you do have a very good idea of what is going wrong, and you just can't tell us, but in this case, how are we supposed to help you?

        From here I just see "Well I've got this system, and it works fine with 100 documents but it sucks with 6000 documents, so I'm going to do X", but I, at least, have no idea if X will help or not because you've told me nothing about the system.

        Or possibly I'm just confused because I don't understand enough of what you've said about what the system actually is or does.