Re: Musings about a database ecology
by Mutant (Priest) on Dec 20, 2004 at 18:05 UTC
|
Your first step could be (and possibly *should* be - although I can't quite tell from your description where the bottleneck lies) optimising the database, and your queries. DB & query optimistion is a fine art, and I suggest consulting the documentation for your specific database. Most have fairly lengthy sections on optimisation (and how to view / interpret query plans, etc).
9000 'things' isn't really a hell of a lot for most half-decent DB systems (even if the 'things' span multiple tables, requiring joins). I've run plenty of systems with many more, and (after optimising) had few issues with performance.
Besides that, I'd also look at whether your hardware is up to the mark - that's a fairly easy thing to guague, just install some good performance statistics software.
Application changes would be my last port of call. But even then, I'd first look at individual modules, and think about profiling them, and possibly optimising code if possible (although I wouldn't spend a whole lot of time on this, unless there's something really obvious). Also, I'd look at CPAN and see if there's something already there that's been properly optimised for what I'm trying to do (maybe even as a replacement for another CPAN module I'm already using.)
Finally, I'd begin to consider the type of architectural changes you're talking about. Although these types of changes may be the most sure-fire way of improving performance (because they're completely customised to your specific problems), they're also the most likely to introduce other problems.
| [reply] |
Re: Musings about a database ecology
by dragonchild (Archbishop) on Dec 20, 2004 at 16:38 UTC
|
Clearly the largest one is where the Perl scripts (both CGI and administration) prod the database, asking "Got anything for me to do?"
To me, this implies that the database contains events that need processing. Design considerations aside, I get very scared when a database is used to queue up events. IMHO, much better would be a small long-running process that accepted XML-RPC/SOAP/HTTP/protocol-of-choice connections that were of the lines of:
- Add event (w/params)
- Check for event(s) of type(s) X (and Y and Z)
- Accept event N
- Event N complete
Then, the database (which is very expensive to talk to) is only used when you need to talk to the data (ie, the documents). Plus, your prioritization can be done by this very light-weight daemon.
The neat thing is that you can put this on the same server as your database / application and barely feel it. A daemon like this has a very small footprint.
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] |
|
|
The trouble is, if you want these events reliably queued, you need to store them in a database of some kind. It doesn't have to be the same one as your other data, and it doesn't even have to be relational, but it needs to be a system that can guarantee atomic operations, and survive a restart. You could write a whole separate daemon and store things yourself, but you'll probably end up writing your own version of a database daemon. I'd suggest just setting up a separate lightweight database, if the main one is already overloaded.
| [reply] |
|
|
but it needs to be a system that can guarantee atomic operations, and survive a restart
I would want such a system, but talexb hasn't indicated if either of those are needed. Plus, you don't need atomicity if you only have one consumer. If the daemon is single-threaded, atomicity is undesired overhead.
A database of some kind, even if it's just Cache, is definitely needed at some layer. I would definitely avoid the relational overhead if it's just a queue of discrete atoms.
But, I think that saying "It needs a database" and "It must use a database as the primary queue implementation" are two separate statements. Flushing to a database is definitely important, to make sure that restarts are survivable1. However, I would use a Perl AoH as the queue.
- Of course, survivable is a spectrum. I would initially propose flushing to the datastore every minute or so. So, you would have up to 60 seconds of data that could be lost. Depending on the system, this may be ok or it may not be ok. (In some systems, it might even be desirable, depending on your other components.)
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] |
|
|
|
|
To me, this implies that the database contains events that need processing. Design considerations aside, I get very scared when a database is used to queue up events.
Quite. One idea to speed things up was to put 'events' into a much smaller table and use triggers on the bigger tables to add things to the small table. That way I no longer scan the big table.
It's an idea that I hope to explore in the near future.
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] |
|
|
Shouldn't this event handler use some form of persistant storage?
Michael
| [reply] |
|
|
| [reply] |
Re: Musings about a database ecology
by osunderdog (Deacon) on Dec 20, 2004 at 17:29 UTC
|
The performance catch-22s:
-
#1: A server must perform well enough for it to get use, but as soon as it gets used too much the performance begins to degrade. For new systems, you may have an idea what the bottle necks are, but you won't know until the server exists (in some sense).
-
#2: You can throw a server together quickly, but it probably won't perform well over time. If you write a server that will perform well over time, it will be larger, complicated and difficult to maintain. (And critics will say that it should have been written in a faster language that has more design constructs. Not my words, their words...)
-
#3: It's very difficult to go from something that works (albeit with lower performace) and make incremental enhancements to improve performance dramatically. Sure you can optimize loops and such, but in my experience an order of magnitude performance improvement requires a design change.
I'm in a similar position as the OP, but I haven't written my server yet. But I know for a fact that it's going to process more than 6 transactions per second, each transaction involves 3-6 individual database executions and some of those database executions takes more than 2 seconds.
I friend of mine uses the "Field of Dreams" movie as an example. The last scene shows this wonderful baseball diamond in a corn field & lots of people arriving in cars to watch baseball. What you don't see is the scene after that where all those cars are parked willie nillie in the corn, small mobs of people are looking for something to eat (other than corn), and, of course, more than a few looking for quiet spot (in the corn) to go to the bathroom.
Sorry, no answers, just some observations.
"Look, Shiny Things!" is not a better business strategy than compatibility and reuse.
OSUnderdog
| [reply] |
Re: Musings about a database ecology
by tilly (Archbishop) on Dec 20, 2004 at 18:23 UTC
|
I'll second what Mutant said: if you are running into data scalability issues at 9000 items then something is wrong in your database design. Perhaps some queries are slow. Perhaps you're manually fetching and refetching in an unfortunate way. Either way the odds are very good that an appropriate index or 3 will solve your problem for a bit with no application changes. | [reply] |
Re: Musings about a database ecology
by iburrell (Chaplain) on Dec 20, 2004 at 21:30 UTC
|
First, as other people have said, if your database is slowing with 9000 documents, something is broken with your schema, your database server, or your code. Look at your database access code for complicated or too many queries. Look at your schema for adding indexes.
Second, you shouldn't need to worry about locking when accessing the database. And if you do need locking, then use the database locks. The whole point of using a transactional database is to hide much of the locking and still get transactions and atomic updates. Which database server are you using?
| [reply] |
|
|
I'm using PostgreSQL, and it was only after I took all the locks out six months ago that we started getting reasonable throughput .. now it seems to have bogged down again in a similar way.
I've installed Devel::Profile but it's having some strange complication with Log::Log4Perl that I don't quite understand, and dumping out of the processing loop instead of hanging around and sending off its "Hey!" messages at regular intervals.
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] |
|
|
You *are* using the vaccuum command and the analyze commands, right? Just wondering, because I felt I knew PostgreSQL after using it for years, and then realized I wasn't using these things correctly. In either case, it sounds like you are not 100% sure yet about where the bottleneck is, so you should probably check out the SQL queries that you are using, etc. did "Explain" give you any info? Are your most-often used queries using the correct indices? Would you benefit if you create a PL/PgSQL function?
| [reply] |
|
|
As others have hinted at, I'd suggest some time in a PostgreSQL forum and optimize the database.
Note, this may include changes to your code so it presents the queries in a manner PostgreSQL likes. I don't know anything about PostgreSQL so I can't tell you if bind_param and the other things like that (I assume you are using DBI?) have any affect. In Oracle land, you want to make sure you don't cause the database to have to re-parse the query with every hit so you use place holders in the query text and use bind_param (I think that is the method call) to set the proper variables. Each database has different parts of the query that can be set in this way.
Other issues would also be database specific such as indexing, type of indexes, computing statistics for cost based optimization vs rule based optimzation that would require you to rearrange your where clause, etc. Also look to PostgreSQL specific features that may help. Oracle has a built in queue that may be more what you want for dealing with events. Others mentioned triggers that may or may not be able to call out to the operating system. You may also be able to (with the proper tools) profile your SQL directly in PostgreSQL to make sure your queries are not particularly heinous.
Some perlish things to consider (but I'd do all the above first) would be to make sure you are not wasting time disconnecting and reconnecting to the database if these things are hitting the database every second or so (Oracle takes up to three seconds to connect and disconnect). If you have a lot of these, some sort of connection pooling may be in order, but again, that depends on how these are being initiated and what PostgreSQL likes and doesn't.
Generally speaking, home grown locking mechanisms will just hurt you in ways you won't enjoy...
| [reply] |
|
|
I had a closer look at the queries that I've been doing, and it turns out that the query is searching through a 180,000 row table every twenty seconds or so. it's taking 12 seconds to do that .. so now I have a much better idea of where to optimize, and a temporary table holdnig 'jobs to be done' is looking a lot better.
Thanks for your feedback.
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] |
|
|
|
|
|
|
|
Re: Musings about a database ecology
by BUU (Prior) on Dec 20, 2004 at 17:18 UTC
|
Sorry, but after reading all of that I'm a tad confused. You say I have another server with 6000 documents -- it's better now than it was with 9000 documents, but it's still too slow.
Then you go on to talk about "process locks" and so forth. What I don't understand is, exactly *which* part is too slow? The database server? How you access the database? The CGI scripts? | [reply] |
|
|
I can't be too specific about how the whole thing works, unfortunately, but basically the problem is that the system works with documents, and one system was very slow with 9000 documents; I deleted some of the documents (and their related objects) to get it down to about 6000, and the performance improved, but only marginally.
I am considering locks so that the various processes tread on each other as little as possible. And all of the parts of the big system are too slow, considering that a smaller system with hundreds of documents is acceptably fast.
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] |
|
|
Perhaps it's just a function of the vagueness required by whatever job you happened to be in, but everything you say sounds like a severe case of premature optimization. You *sound* like you have no idea what is slowing down the system so you'll just implement some kind of locking system to hopefully make it go faster?! But maybe you do have a very good idea of what is going wrong, and you just can't tell us, but in this case, how are we supposed to help you?
From here I just see "Well I've got this system, and it works fine with 100 documents but it sucks with 6000 documents, so I'm going to do X", but I, at least, have no idea if X will help or not because you've told me nothing about the system.
Or possibly I'm just confused because I don't understand enough of what you've said about what the system actually is or does.
| [reply] |
|
|
|
|