in reply to Re^2: Feasability of quick generation of complex report from large db?
in thread Feasability of quick generation of complex report from large db?

So, all the info could have been put into one big 30,000 row x 12,000 +column table ...
I don't think you want a 12,000 column table. Just use an 'intersection' table which has just two columns, one for the company and the other for the question they answered in the affirmative. There would not be any rows for questions they answered 'no' to.

Your example query can be performed using a self-join like this:

SELECT a.company from CA as a, CA as b where a.company = b.company and + a.answer = 'group_a' and b.answer = 'product_class_c';
(It should be obvious that indexes on the 'company' and 'answer' column would be very helpful.)

Note that this method allows you to add more questions without having to change the schema of the table.

Replies are listed 'Best First'.
Re^4: Feasability of quick generation of complex report from large db?
by punch_card_don (Curate) on Feb 06, 2008 at 17:55 UTC
    I don't think you want a 12,000 column table.

    On this, we are in complete agreement.

    Just use an 'intersection' table which has just two columns, one for the company and the other for the question they answered in the affirmative. There would not be any rows for questions they answered 'no' to.

    If I understand correctly if on average each company has answered 'yes' to ~500 questions, I'll have a two-column, 15-million row table?

    And I'd still have 350 individual queries to run on this table, no?

      If I understand correctly if on average each company has answered 'yes +' to ~500 questions, I'll have a two-column, 15-million row table?

      You're going to have 15 million somethings - you have a lot of data! This is one of the most efficient ways to represent a large sparse matrix. Also, having the data in this form will allow you answer so many other kinds of questions.

      As I read your original problem statement, you are only are interested in the number of companies in each intersection. Moreover, it seems that you are only interested in cross-indexing company group by product class. I would create two intersection tables, one for the company group answers (call it CG), and another for the company product class answers (call it CP).

      You'll still have the 15M rows, but it will be split between two tables.

      Then you can do this with one query:

      SELECT CG.answer as cgroup, CP.answer as pclass, count(*) FROM CG, CP WHERE CG.company = CP.company GROUP BY cgroup, pclass;

      With the proper indexes, I think you'll be surprised at how fast this can be computed. Depending on the database you're using, you might want to do a LEFT JOIN instead of a regular join. You'll be able to tell by doing an explain on the different queries.

      You can do it with the original schema, except that it would be harder to restrict the join condition to just company-group and product-class.

        Very interesting. I'm liking this immensely, so I'd like to ask some questions despite knowing they won't make me look smart:

        Your query returns the count for every possible cgroup x pclass combination every time. That's 70 x 12,000 = 840,000 count results returned. Then what - in Perl take just the combinations I'm intereted in. No - maybe I can add to your query - since a report uses just 5 industry_groups, I'd add:

        SELECT CG.answer as cgroup, CP.answer as pclass, count(*) FROM CG, CP WHERE CG.company = CP.company AND (cgroup='a' OR ..... cgroup='e') GRO +UP BY cgroup, pclass;
        Which will reduce results significantly.

        But - a single query on 15-million rows that returns ~50,000 results, faster than 350 queries on pairs of little ~10k~1k-row tables.....sounds possible.