Re: How best to optimize shared data

We've got an existing system which loads data by doing individual SQL queries for each record read, and caching results. We have multiple processors on our HP/UX system, so we take advantage of this by running multiple copies of our program, but the results are still far too slow.

Your problem has nothing to do with configuration data or shared memory or any crap like that. Have you benchmarked where your bottlenecks are? It doesn't sound like you have.

The good news is that I don't have to - your bottlenecks are in the SQL reads and writes. I will bet that you don't have good indices, that your loads are updating indices every time, and that you can increase your throughput 100-fold if you had a DBA consultant with 10years experience come in for 2 weeks and audit your system.

A few items for you to look at:

Are your tables normalized?
Are you using the RDBMS's data extraction and loading features? They will be up to a thousand times faster than anything written in Perl.
Do you have indices on the tables you're reading? Have the tables you're reading been optimized?
Do you have indicies on the tables you're writing to? If you disable them while writing, you can increase throughput 10-100x.
With your multiple processes, are you hitting mutexes that are negating all benefits of SMP? For example, many RDBMSes will not let two processes update the same table at once, especially if the rows are in the same page of memory. Likewise, some will not let you insert into the same table at once, but some will. And, even crazier, some are configurable.
Are you taking advantage of all the RDBMS features? For example, MySQL will let you insert up to 1000 records at once. Oracle won't, but has other features.

The overarching theme is Know your tools. It doesn't sound like you really understand them.

Being right, does not endow the right to be rude; politeness costs nothing.
Being unknowing, is not the same as being stupid.
Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Comment on Re: How best to optimize shared data

Replies are listed 'Best First'.
Re^2: How best to optimize shared data by Anonymous Monk on Feb 04, 2005 at 17:24 UTC
Your problem has nothing to do with configuration data or shared memory or any crap like that. Have you benchmarked where your bottlenecks are? It doesn't sound like you have. Yes, I've profiled the code. The bottlenecks are in the I/O. No surprise there. The sql queries are right at the top, even before the file I/O. The good news is that I don't have to - your bottlenecks are in the SQL reads and writes. Actually, they're exclusively in the SQL reads. We don't have any SQL writes. If we did, I wouldn't have specified that all the configuration data was read only, now would I? I will bet that you don't have good indicies, that your loads are updating indices every time, and that you can increase your throughput 100-fold if you had a DBA consultant with 10years experience come in for 2 weeks and audit your system. Well, we have indexes on all our key fields, and we don't do any writes. We don't load data into tables, so it's unlikely that the nonexistant loads are updating the table indexes. It might be possible that a DBA with 10 years experience might give 100 fold improvement to our database, but I confess that I don't see how. I have neither the authority nor budget to hire a DBA, so it's a moot point. I can get a 100 fold improvement by moving code out of the SQL tables, though. Profiling lookups on one of our table using the existing code gives 670 lookups/second using the (poorly written, IMHO) database solution, versus 56,818/s for a direct perl hash lookup. This is a 100 fold improvement: and it's one that I can implement. The overarching theme is Know your tools. It doesn't sound like you really understand them. Does Informix have a secret "go fast" switch I can flip? Failing that, I'm not sure what else to try (but I'm certainly open to specific ideas.) That's why I went with my own system, having learned in school that relational databases are good for many reasons, but none of them are maximum speed. The simplest approach is just to throw everything into memory in a perl hash-like structure. This works, but load time becomes a factor, and we hit the per-process memory limit. Changing this would require a recompile of the HP/UX kernel, which doesn't sound like an option. So,I need to balance total memory use with lookup speed. Perl hashes on my system seem to expand at somewhere like 40:1 memory usage: I've seen a 7MB flat text file turn into nearly 100MB perl hash. Using some smarter data management, I can get that down to a 5MB binary file including indexes. Read from it using pure perl is still ten times faster than the SQL solution, and that's before I start to play with XS. This solution seems do-able. Shared memory may not be the way to go, but do you see any other solutions? -- Ytrew	[reply]
Re^3: How best to optimize shared data by Tanktalus (Canon) on Feb 04, 2005 at 17:35 UTC
I was about to recommend looking at Cache::SharedMemoryCache. And then I read it. It says to use a filesystem cache because the speed difference is trivial. Which tells me that simply bumping up your buffers for your database may solve your speed problems. Instead of throwing memory at a shared memory cache, throw that memory at the database, and see if that helps.	[reply]
Re^4: How best to optimize shared data by perrin (Chancellor) on Feb 04, 2005 at 19:48 UTC
That's just a particularly slow cache module. Most of them are slow, but the ones I mentioned farther down this thread are fast.	[reply]
Re^3: How best to optimize shared data by dragonchild (Archbishop) on Feb 04, 2005 at 17:51 UTC
Profiling lookups on one of our table using the existing code gives 670 lookups/second using the (poorly written, IMHO) database solution, versus 56,818/s for a direct perl hash lookup. This is a 100 fold improvement: and it's one that I can implement. Well, we have indexes on all our key fields, and we don't do any writes. Ok - you have indices, but are you using them? What is the execution plan for your queries? Why are you pulling all your data out instead of looking at it when you need it? There are sooo many things involved in optimizing this kind of thing that you really need an expert on the ground, so to speak, to really give you more than just very generic places to look. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re^2: How best to optimize shared data by jpk236 (Monk) on Feb 04, 2005 at 15:52 UTC
good reply; listen to the man. normalized tables and indexes are key! properly implemented indexes took a script of mine from a 8 minute run-time to a 1-2 minute runtime. Long story short, they can help a lot! Justin	[reply]