in reply to Re^4: Feasability of quick generation of complex report from large db?
in thread Feasability of quick generation of complex report from large db?

If I understand correctly if on average each company has answered 'yes +' to ~500 questions, I'll have a two-column, 15-million row table?

You're going to have 15 million somethings - you have a lot of data! This is one of the most efficient ways to represent a large sparse matrix. Also, having the data in this form will allow you answer so many other kinds of questions.

As I read your original problem statement, you are only are interested in the number of companies in each intersection. Moreover, it seems that you are only interested in cross-indexing company group by product class. I would create two intersection tables, one for the company group answers (call it CG), and another for the company product class answers (call it CP).

You'll still have the 15M rows, but it will be split between two tables.

Then you can do this with one query:

SELECT CG.answer as cgroup, CP.answer as pclass, count(*) FROM CG, CP WHERE CG.company = CP.company GROUP BY cgroup, pclass;

With the proper indexes, I think you'll be surprised at how fast this can be computed. Depending on the database you're using, you might want to do a LEFT JOIN instead of a regular join. You'll be able to tell by doing an explain on the different queries.

You can do it with the original schema, except that it would be harder to restrict the join condition to just company-group and product-class.

Replies are listed 'Best First'.
Re^6: Feasability of quick generation of complex report from large db?
by punch_card_don (Curate) on Feb 06, 2008 at 19:47 UTC
    Very interesting. I'm liking this immensely, so I'd like to ask some questions despite knowing they won't make me look smart:

    Your query returns the count for every possible cgroup x pclass combination every time. That's 70 x 12,000 = 840,000 count results returned. Then what - in Perl take just the combinations I'm intereted in. No - maybe I can add to your query - since a report uses just 5 industry_groups, I'd add:

    SELECT CG.answer as cgroup, CP.answer as pclass, count(*) FROM CG, CP WHERE CG.company = CP.company AND (cgroup='a' OR ..... cgroup='e') GRO +UP BY cgroup, pclass;
    Which will reduce results significantly.

    But - a single query on 15-million rows that returns ~50,000 results, faster than 350 queries on pairs of little ~10k~1k-row tables.....sounds possible.

      With indexes on the right columns, definitely. Tables Answer1, Answer2, Answer3, ... in a database are at least as much a red flag as variables $Answer1, $Answer2, $Answer3, ... in Perl (or any other even remotely sensible programming language). Just as those variables should be made an array and the number an index, those tables should be changed just to a single table and the number made a column.

      If you ever find yourself having to maintain a project with tables like that (it doesn't have to be foo1, foo2, ... it can be users_clientname1, users_otherclient, ...) run! Either to kick some butt or to find a new employer. I once had to do a few things in such a project, the number of TABLES in the database was in tens of thousands, insane. And all the butts responsible for that mess were long gone :-[

      Another option is to pre-compute this big query just once and store it in a table. Your smaller queries would just be selects from this table, and that would make them run really fast (if interactive response time is an issue.)