in reply to Feasability of quick generation of complex report from large db?

Several reactions to this...

Replies are listed 'Best First'.
Re^2: Feasability of quick generation of complex report from large db?
by punch_card_don (Curate) on Feb 06, 2008 at 16:35 UTC
    Wonderfully rich reply - thanks.

    But I think, maybe, some miscommunications.

    To put it into similar terms:

    • I have companies (ex: ABC Inc, XYZ Inc, ...) and I have product classes (ex: pens, liquid nitrogen, desk chairs, fork lifts, plywood, fourier-transform-infrared-spectrometres,...)

    • yes, one of the attributes of a company is its size_factor

    • a product_class has no attributes

    • Someone has canvassed ~30,000 companies and asked them:
      • do you belong to industry group X (ex: more than X employees)?
      • do you belong to industry group Y (ex: located in Ontario, California)?
      • do you belong to industry group Z(ex: less than 5 years in business)?
      • do you use product class A?
      • do you use product class B?
      • do you use product class C?
      • ...and so on, for ~12,000 product classes.

    So, all the info could have been put into one big 30,000 row x 12,000 column table, and then just query

    SELECT company WHERE (group_a = 1 and product_class_C = 1);

    But I still have to repeat that query for 350 combinations of group and product_class.

    Originally, there was only going to be a requirement to produce a half-dozen cross-references at a time. So, came the idea of the reverse-index tables - easy and quick to identify the ten tables needed, find the intersection of these little <= 10Kb tables that are often only a few hundred rows long, and sum the size-factors.

    But THEN someone said "Hey, that's great - and fast - here, do that for this report of 350 intersections."

    Update: I appear to have at least partially answered my own question - even if it was done the other way, there'd still be 350 queries to run, except each one on a full 30k-row x 12k column table!

      This sounds like a bit of heavy lifting for a database server -- but not an impossibility. What I'm not getting is how often you have to regenerate this big table. Every time a web user wants to look at it? Every time something in the underlying database changes?

      I would imagine you could generate and then cache the 'latest version' of the table at the end of every day, assuming new data is added throughout the day.

      Alex / talexb / Toronto

      "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

        Hey TAlexb,

        Ya, everytime a user wants to see it is the request. The idea being circulated right now is that various combinations of product_classes and industry_groups will be dreamed up for a report. There are likely to be ~6,000 of these reports. Yes, someone is going to dream up ~6,000 combinations of 350 intersections. Then a user goes to the list of reports and selects which he wants to see today.

        And so, yes, I'm toying with the pre-generation option too, since once the data has been loaded, it never, ever changes. But at some point, you just know they're going to want to be able to generate every possible set and sub-set of intersections, so I think the pre-generation of static files would be a temporary delay of the inevitable pain.

      So, all the info could have been put into one big 30,000 row x 12,000 +column table ...
      I don't think you want a 12,000 column table. Just use an 'intersection' table which has just two columns, one for the company and the other for the question they answered in the affirmative. There would not be any rows for questions they answered 'no' to.

      Your example query can be performed using a self-join like this:

      SELECT a.company from CA as a, CA as b where a.company = b.company and + a.answer = 'group_a' and b.answer = 'product_class_c';
      (It should be obvious that indexes on the 'company' and 'answer' column would be very helpful.)

      Note that this method allows you to add more questions without having to change the schema of the table.

        I don't think you want a 12,000 column table.

        On this, we are in complete agreement.

        Just use an 'intersection' table which has just two columns, one for the company and the other for the question they answered in the affirmative. There would not be any rows for questions they answered 'no' to.

        If I understand correctly if on average each company has answered 'yes' to ~500 questions, I'll have a two-column, 15-million row table?

        And I'd still have 350 individual queries to run on this table, no?