Re: Benchmarking A DB-Intensive Script
by rnahi (Curate) on Mar 14, 2006 at 07:37 UTC
|
Using Devel::Dprof and dprofpp,
which are part of recent Perl distributions,
you should find out where your code is slow.
Regarding the database issues, DBI::Profile can
find which methods are taking too long (you can look
at Speeding up the DBI for suitable examples and tips).
| [reply] |
|
|
And when you use Devel::DProf, tell it to report wall time, so you can see the time spent waiting for the database. Otherwise, it just measures CPU, and you get a strange picture in a database-heavy script.
| [reply] |
Re: Benchmarking A DB-Intensive Script
by tilly (Archbishop) on Mar 14, 2006 at 18:00 UTC
|
My general strategy for problems like this goes as follows. First, add some stupid tracing to get an idea whether it is going fast enough. If it is not, then do a quick back of the envelope calculation about what it is doing, and look for stupidly obvious improvements that are easy to do. Only if that doesn't find enough will I try to trace down performance problems.
So you expect to run for 100 million iterations and want to finish in a few weeks. OK, there are 60*60*24*7*4 = 2,419,200 seconds in 4 weeks. That means that 10,000 iterations should take about 240 seconds, or about 4 minutes. So add a message that prints every 1000 iterations, says how many iterations it has done in how many minutes. For your program, you'll also want it to say how many have a min(property-pair score) above your threshold. Start running it. If it is going fast enough, leave it be. If it isn't going fast enough, you know you have a performance problem. (Algorithm check, check for quadratic inefficiencies. 70,000 elements means 70_000*69_999/2 pairs of elements, or about 2.45 billion. Since you're only looking for 100 million or so pairs, your grab a pair/throw it away algorithm won't slow you down more than a few percent by the end. Unless, of course, something else slows down over time. Like your check for whether you've done a pair. That's one good reason for having regular progress reported. Often programs start fine then progressively get slower due to a subtle algorithm problem. A benchmark of 10,000 elements won't find that out. It really sucks to write a program that you think will take a week to run, and then be faced a month later with the decision about whether it is better to let it continue running, or to kill it and try to analyze why it took a month, without having any idea how close it is...)
Now suppose you run that, and it is not fast enough. (Random note, you're almost definitely going to be fast enough. 10,000 iterations in 4 minutes is 4/second. From my workstation I can fetch from my production database at about 40 times that speed - and that production database is in a nearby city.) Well first do a sanity check - glance at top. If your program is taking up a lot of CPU, then there is a bottleneck in how the program is coded. If it isn't, then the bottleneck is elsewhere - probably database access in your case.
Let's assume that database access is the issue and do some quick back of the envelope estimates. With 70,000 elements looking for 100,000,000 pairs you're going to fetch each element's properties an average of 1430 times or so. If you have more than 49 elements that you know you're uninterested in (eg no properties), then an initial pass to get rid of them pays for itself. Furthermore with so many accesses per element, speeding that up certainly makes sense - even if it does require substantial preprocessing. (Stupid sanity check.
Some options for speeding that up are as follows. The simplest is to move the program to the database machine, or failing that then close to it on the network. That reduces latency substantially. Even better is to fit it in memory if you can. (Advice from experience - do NOT try to fit the "pairs done" list in memory. It won't fit.) Otherwise you need to store stuff in your local filesystem. My inclination if you're writing from scratch is to use Berkeley DB (either through DB_File or BerkeleyDB) for simplicity. However if you have the program already written, then DBD::SQLite is likely to be a good speedup and will be an easy port. (You still have to write the code to take your existing database and port it locally.) Or create a local MySQL mirror. Same reasoning.
After you do that, benchmark it again. If you're still not getting 10,000 iterations to happen within your desired time frame, then I'd suggest breaking out a profiler. But not until then.
It is a reward/risk issue. If there are easy changes to make that are very likely to make substantial performance improvements, it makes little sense to spend energy verifying that - just make the changes and run it again. You probably helped, and if you didn't, you didn't waste much energy. However if the changes that you think will help are hard to do, then profile it to be sure that it is worth doing them. Also if it isn't obvious what needs to be improved, then profile your program rather than guessing.
Also note that a profiler is no substitute for a brain. Suppose that your profiler tells you that you're spending all of your time doing step X. It takes a brain to figure out whether you're better off trying to optimize step X, or are better trying to avoid doing step X so often. (Hint, it is usually simpler to try to speed up step X, but you typically see far bigger wins from finding ways to avoid doing step X.)
Furthermore note that with the amount of data you're interested in processing and the amount of time you're willing to spend, you're very unlikely to have a performance problem. Unless you have a quadratic slowdown in processing, which is unlikely to show up until you've gone through hundreds of thousands, or millions of elements. In fact I'd frankly be more worried about your, "done that pair" check slowing down than I would about your database being too slow.
UPDATE: D'oh. I wrote this whole thing and forgot to tell you to check the obvious stuff. If you've run the code and found it is too slow despite your fairly modest performance needs, and your CPU usage is low, then you probably are missing some indexes on fields. Fetching everything at once and processing it locally with Berkeley DB would keep you from needing to learn about that. Otherwise talk to someone who understands databases, figure out what indexes you need, add them, and things will get a lot faster.
Basically an index allows the database to find a particular record in a table quickly, rather than having to scan the whole table each time. It is the algorithmic difference between using a hash and scanning an array. | [reply] |
|
|
Hi tilly,
Thank you: this is really sound advice. In a way this is common-sense, but it was common-sense I hadn't been applying.
I first quantified just "how much too slow" the overall program is. It's averaging about 75 seconds per 1000 records, but I'm aiming for 24s so that gave me a target.
Next, I looked at CPU usage, and indeed it *is* climbing quite slowly -- indicating a DB or network problem. I checked for indices, and all seemed okay so I simply copied the relevant DB tables to the machine running this analysis. BOOM: averaging 49 seconds/1000k.
So big picture thinking got me halfway there. I'm now profiling to see just where the code is going slow to see if it's worth trying to load the property-tables into memory or not.
Good solid advice, thanks. ++tilly.
| [reply] |
|
|
You're close enough that understanding the big picture may get you the rest of the way there as well. :-)
Try running two copies of the script at the same time. See how fast they go. Then three. Then four. Find the point where you don't run faster by running more copies.
Each one will write the examples it finds to its own file. It is trivial to afterwards go and remove duplicates from those files. (Sort each pair in each line alphabetically so that a pair always is on a line that looks the same. Then do a sort -u to find and remove duplicates.)
| [reply] |
Re: Benchmarking A DB-Intensive Script
by idle (Friar) on Mar 14, 2006 at 07:32 UTC
|
use Time::HiRes qw(gettimeofday);
my $start=gettimeofday;
next part of code...
my $next=gettimeofday;
printf "time spended %.4f" ,($next-$start);
next part of code...
my $next_1=gettimeofday;
printf "time spended %.4f" ,($next_1-$next);
to see where exactly it is slow. | [reply] [d/l] |
|
|
Hi, I was thinking of something like this using Benchmark as well, but then how do I add the values across many (say 10k) iterations of the loop? I didn't see a way to do that with either Benchmark or Time::HiRes -- did I miss it?
| [reply] |
|
|
Take a look at Benchmark::Timer. You can wrap a start/stop pairing around a piece of code and it accumulates the iterations and average the times taken for you. You can also run several different and overlapping pairs concurrently, which makes it easy to start with a course granularity and then home in onto those parts where you need finer granularity.
It allows you to get finer granularity than one of the per subroutine profilers without resorting to the time expensive per line profilers.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Benchmarking A DB-Intensive Script
by thundergnat (Deacon) on Mar 14, 2006 at 12:12 UTC
|
70k items is not that large for an array. Depending on how many properties are attatched, you probably could hold the whole thing in memory at once. It would be worth trying since that will likely speed up your processing by orders of magnitude. Filter out any element with no properties as you load the array.
Don't just select pairs at random and then check to see if they've already been done. That gets very inefficienct after you checked half of the array. Iterate over it instead and only check each pair once. Select the elements with a set of nested loops.
| [reply] |
|
|
Hi thundergnat,
Right, but I am looking for pairs from the 70k array, so that leaves me 70k x 70k / 2 = 2.4 billion pairs, of which I probably only need about 100 million (about 4%) so I really do need to get them randomly.
But you're definitely right that holding the property-lists in memory is the obvious place to start tuning. Just wanted to figure out first if that's truly where the script is spending all its time!
| [reply] |
|
|
I don't think thundergnat is suggesting you shouldn't select random items, but that you not re-use the same random pool each time.
If you select random keys from a hash, then delete the keys you pick, you'll progressively reduce your options and never hit a repeat. It doesn't seem significant to your use that you allow for re-selection of previously examined items, so remove them from the pool of future options. Otherwise it will become harder and take longer to find good pairs, instead of quicker.
| [reply] |
|
|
|
|
|
|
|
|
Out on a limb..
by vhold (Beadle) on Mar 14, 2006 at 21:50 UTC
|
While it would definitely speed things up to cut out the DB lookups and cache those properties.. overall it sounds like the real problem is the algorithm from a high level perspective. (which.. isn't what you are asking about at all, but maybe you aren't seeing the forest for the trees)
If the only reason you select the elements at random is so that you can get a subset of the whole, why can't you sort the 70k item array by these property things somehow and check pairs in a sequence that is more likely to create pairs above your threshold?
It's difficult because your description of the problem doesn't go into specifics of what these properties are or how many of them there are, etc. But a real cheapo but potentially effective approach might be something along these lines:
Iterate your property-pair score hash and create a hash 'property score', keyed by property only. If the property-pair score is above your threshold, add that to each affiliated property score. If it's below, subtract from them.
This is very arbitrary and only vaguely sensical, but the idea is that doing almost -anything- is likely going to be better then random. You might want to apply tunable coeffecients that affect the weight of add and subtract operations independantly.
Then iterate your 70k elements, assigning them a score which is the sum of all its properties' scores. Sort by that score, and iterate highest scores first.
Since the whole point isn't to provide the absolute best scores, you can tweak and weight the various scores in whatever way you choose depending on what this data is exactly.
This is all based on a potentially wrong assumption that individual properties that are more associated with pairs above the threshold, and less associated with pairs below the threshold, are more likely to make elements be associated with pairs that pass your criteria. For all I know, all the property-pairs perfectly balance each other somehow and your property score ends up being 0 across the board and this approach would have zero effect. This is the sort of thing where knowledge of the data is key. | [reply] |
Re: Benchmarking A DB-Intensive Script
by salva (Canon) on Mar 20, 2006 at 10:50 UTC
|
can't you just write an SQL query to get the list of pairs meeting your criteria from the database? in doesn't seem impossible from your specification.
Alternatively you could write a query to get a subset of promissing pairs, and then check if those meet your criteria with a Perl script. | [reply] |
Re: Benchmarking A DB-Intensive Script
by TedPride (Priest) on Mar 20, 2006 at 06:45 UTC
|
I must be missing something here, but why aren't you just storing all the properties for each element in memory, rather than querying the DB each time? Assuming you can allow 100MB of memory for this, and you have 70K elements, that's 1462 bytes per element, which should cover all your property needs. If it doesn't, allocate and/or buy more memory. This is by far the biggest time sinkhole right here.
Secondly, you should be able to generate the lowest "property pair" values by comparing your largest values with your smallest values, rather than doing a random selection. If you still want random output, exclude all pairs that will result in a value over your threshold (let's say you want 7, and item 1 has a value of 4, so anything with a value under 3 can be ignored), then choose from the resulting pool of pairs.
Anything else will just be polishing the finish. | [reply] |