in reply to Re: Feasability of quick generation of complex report from large db?
in thread Feasability of quick generation of complex report from large db?

Wonderfully rich reply - thanks.

But I think, maybe, some miscommunications.

To put it into similar terms:

So, all the info could have been put into one big 30,000 row x 12,000 column table, and then just query

SELECT company WHERE (group_a = 1 and product_class_C = 1);

But I still have to repeat that query for 350 combinations of group and product_class.

Originally, there was only going to be a requirement to produce a half-dozen cross-references at a time. So, came the idea of the reverse-index tables - easy and quick to identify the ten tables needed, find the intersection of these little <= 10Kb tables that are often only a few hundred rows long, and sum the size-factors.

But THEN someone said "Hey, that's great - and fast - here, do that for this report of 350 intersections."

Update: I appear to have at least partially answered my own question - even if it was done the other way, there'd still be 350 queries to run, except each one on a full 30k-row x 12k column table!

  • Comment on Re^2: Feasability of quick generation of complex report from large db?
  • Download Code

Replies are listed 'Best First'.
Re^3: Feasability of quick generation of complex report from large db?
by talexb (Chancellor) on Feb 06, 2008 at 17:01 UTC

    This sounds like a bit of heavy lifting for a database server -- but not an impossibility. What I'm not getting is how often you have to regenerate this big table. Every time a web user wants to look at it? Every time something in the underlying database changes?

    I would imagine you could generate and then cache the 'latest version' of the table at the end of every day, assuming new data is added throughout the day.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      Hey TAlexb,

      Ya, everytime a user wants to see it is the request. The idea being circulated right now is that various combinations of product_classes and industry_groups will be dreamed up for a report. There are likely to be ~6,000 of these reports. Yes, someone is going to dream up ~6,000 combinations of 350 intersections. Then a user goes to the list of reports and selects which he wants to see today.

      And so, yes, I'm toying with the pre-generation option too, since once the data has been loaded, it never, ever changes. But at some point, you just know they're going to want to be able to generate every possible set and sub-set of intersections, so I think the pre-generation of static files would be a temporary delay of the inevitable pain.

Re^3: Feasability of quick generation of complex report from large db?
by pc88mxer (Vicar) on Feb 06, 2008 at 17:15 UTC

    So, all the info could have been put into one big 30,000 row x 12,000 +column table ...
    I don't think you want a 12,000 column table. Just use an 'intersection' table which has just two columns, one for the company and the other for the question they answered in the affirmative. There would not be any rows for questions they answered 'no' to.

    Your example query can be performed using a self-join like this:

    SELECT a.company from CA as a, CA as b where a.company = b.company and + a.answer = 'group_a' and b.answer = 'product_class_c';
    (It should be obvious that indexes on the 'company' and 'answer' column would be very helpful.)

    Note that this method allows you to add more questions without having to change the schema of the table.

      I don't think you want a 12,000 column table.

      On this, we are in complete agreement.

      Just use an 'intersection' table which has just two columns, one for the company and the other for the question they answered in the affirmative. There would not be any rows for questions they answered 'no' to.

      If I understand correctly if on average each company has answered 'yes' to ~500 questions, I'll have a two-column, 15-million row table?

      And I'd still have 350 individual queries to run on this table, no?

        If I understand correctly if on average each company has answered 'yes +' to ~500 questions, I'll have a two-column, 15-million row table?

        You're going to have 15 million somethings - you have a lot of data! This is one of the most efficient ways to represent a large sparse matrix. Also, having the data in this form will allow you answer so many other kinds of questions.

        As I read your original problem statement, you are only are interested in the number of companies in each intersection. Moreover, it seems that you are only interested in cross-indexing company group by product class. I would create two intersection tables, one for the company group answers (call it CG), and another for the company product class answers (call it CP).

        You'll still have the 15M rows, but it will be split between two tables.

        Then you can do this with one query:

        SELECT CG.answer as cgroup, CP.answer as pclass, count(*) FROM CG, CP WHERE CG.company = CP.company GROUP BY cgroup, pclass;

        With the proper indexes, I think you'll be surprised at how fast this can be computed. Depending on the database you're using, you might want to do a LEFT JOIN instead of a regular join. You'll be able to tell by doing an explain on the different queries.

        You can do it with the original schema, except that it would be harder to restrict the join condition to just company-group and product-class.