Re: Benchmarking A DB-Intensive Script

My general strategy for problems like this goes as follows. First, add some stupid tracing to get an idea whether it is going fast enough. If it is not, then do a quick back of the envelope calculation about what it is doing, and look for stupidly obvious improvements that are easy to do. Only if that doesn't find enough will I try to trace down performance problems.

So you expect to run for 100 million iterations and want to finish in a few weeks. OK, there are 60*60*24*7*4 = 2,419,200 seconds in 4 weeks. That means that 10,000 iterations should take about 240 seconds, or about 4 minutes. So add a message that prints every 1000 iterations, says how many iterations it has done in how many minutes. For your program, you'll also want it to say how many have a min(property-pair score) above your threshold. Start running it. If it is going fast enough, leave it be. If it isn't going fast enough, you know you have a performance problem. (Algorithm check, check for quadratic inefficiencies. 70,000 elements means 70_000*69_999/2 pairs of elements, or about 2.45 billion. Since you're only looking for 100 million or so pairs, your grab a pair/throw it away algorithm won't slow you down more than a few percent by the end. Unless, of course, something else slows down over time. Like your check for whether you've done a pair. That's one good reason for having regular progress reported. Often programs start fine then progressively get slower due to a subtle algorithm problem. A benchmark of 10,000 elements won't find that out. It really sucks to write a program that you think will take a week to run, and then be faced a month later with the decision about whether it is better to let it continue running, or to kill it and try to analyze why it took a month, without having any idea how close it is...)

Now suppose you run that, and it is not fast enough. (Random note, you're almost definitely going to be fast enough. 10,000 iterations in 4 minutes is 4/second. From my workstation I can fetch from my production database at about 40 times that speed - and that production database is in a nearby city.) Well first do a sanity check - glance at top. If your program is taking up a lot of CPU, then there is a bottleneck in how the program is coded. If it isn't, then the bottleneck is elsewhere - probably database access in your case.

Let's assume that database access is the issue and do some quick back of the envelope estimates. With 70,000 elements looking for 100,000,000 pairs you're going to fetch each element's properties an average of 1430 times or so. If you have more than 49 elements that you know you're uninterested in (eg no properties), then an initial pass to get rid of them pays for itself. Furthermore with so many accesses per element, speeding that up certainly makes sense - even if it does require substantial preprocessing. (Stupid sanity check.

Some options for speeding that up are as follows. The simplest is to move the program to the database machine, or failing that then close to it on the network. That reduces latency substantially. Even better is to fit it in memory if you can. (Advice from experience - do NOT try to fit the "pairs done" list in memory. It won't fit.) Otherwise you need to store stuff in your local filesystem. My inclination if you're writing from scratch is to use Berkeley DB (either through DB_File or BerkeleyDB) for simplicity. However if you have the program already written, then DBD::SQLite is likely to be a good speedup and will be an easy port. (You still have to write the code to take your existing database and port it locally.) Or create a local MySQL mirror. Same reasoning.

After you do that, benchmark it again. If you're still not getting 10,000 iterations to happen within your desired time frame, then I'd suggest breaking out a profiler. But not until then.

It is a reward/risk issue. If there are easy changes to make that are very likely to make substantial performance improvements, it makes little sense to spend energy verifying that - just make the changes and run it again. You probably helped, and if you didn't, you didn't waste much energy. However if the changes that you think will help are hard to do, then profile it to be sure that it is worth doing them. Also if it isn't obvious what needs to be improved, then profile your program rather than guessing.

Also note that a profiler is no substitute for a brain. Suppose that your profiler tells you that you're spending all of your time doing step X. It takes a brain to figure out whether you're better off trying to optimize step X, or are better trying to avoid doing step X so often. (Hint, it is usually simpler to try to speed up step X, but you typically see far bigger wins from finding ways to avoid doing step X.)

Furthermore note that with the amount of data you're interested in processing and the amount of time you're willing to spend, you're very unlikely to have a performance problem. Unless you have a quadratic slowdown in processing, which is unlikely to show up until you've gone through hundreds of thousands, or millions of elements. In fact I'd frankly be more worried about your, "done that pair" check slowing down than I would about your database being too slow.

UPDATE: D'oh. I wrote this whole thing and forgot to tell you to check the obvious stuff. If you've run the code and found it is too slow despite your fairly modest performance needs, and your CPU usage is low, then you probably are missing some indexes on fields. Fetching everything at once and processing it locally with Berkeley DB would keep you from needing to learn about that. Otherwise talk to someone who understands databases, figure out what indexes you need, add them, and things will get a lot faster.

Basically an index allows the database to find a particular record in a table quickly, rather than having to scan the whole table each time. It is the algorithmic difference between using a hash and scanning an array.

Comment on Re: Benchmarking A DB-Intensive Script

Replies are listed 'Best First'.
Re^2: Benchmarking A DB-Intensive Script by bernanke01 (Beadle) on Mar 14, 2006 at 21:55 UTC
Hi tilly, Thank you: this is really sound advice. In a way this is common-sense, but it was common-sense I hadn't been applying. I first quantified just "how much too slow" the overall program is. It's averaging about 75 seconds per 1000 records, but I'm aiming for 24s so that gave me a target. Next, I looked at CPU usage, and indeed it is climbing quite slowly -- indicating a DB or network problem. I checked for indices, and all seemed okay so I simply copied the relevant DB tables to the machine running this analysis. BOOM: averaging 49 seconds/1000k. So big picture thinking got me halfway there. I'm now profiling to see just where the code is going slow to see if it's worth trying to load the property-tables into memory or not. Good solid advice, thanks. ++tilly.	[reply]
Re^3: Benchmarking A DB-Intensive Script by tilly (Archbishop) on Mar 14, 2006 at 22:36 UTC
You're close enough that understanding the big picture may get you the rest of the way there as well. :-) Try running two copies of the script at the same time. See how fast they go. Then three. Then four. Find the point where you don't run faster by running more copies. Each one will write the examples it finds to its own file. It is trivial to afterwards go and remove duplicates from those files. (Sort each pair in each line alphabetically so that a pair always is on a line that looks the same. Then do a sort -u to find and remove duplicates.)	[reply]