Feasability of quick generation of complex report from large db?

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Mysterious Monks,

Does this sound at all reasonable to undertake on a Lamp platform?

Imagine a table of numbers. ~70 rows & 5 columns. The rows represent classes of products, the columns represent groups of businesses. A cell, the intersection of a row and column, is the cross-referenced-population of companies that belong to that group and use that class of product. For example, group = 'widget manufacturers' that use product = 'foo bars'.

You have a MySQL DB of lists of companies. The table 'widget manufacturers' lists all the companies that manufactur widgets. The table 'foo bars' lists all the companies that use foo bars. Each table includes a column for the 'company size factor' for every company in the list. To find the number that goes in the cell at their intersection, you have to run an SQL query to find the intersection of the two tables' 'company' columns, then sum the 'company size factor' for every company in the results.

There are only a few hundred business group tables, and ~12,000 product class tables. But the output table you have to generate will always only include 70 product classes and 5 groups. There are ~30,000 companies. This is a ~100Mb database, although each individual table might only be ~10Kb average, with anywhere from 10 (table 'enriched uranium')to 10,000 (table 'ballpoint pens') rows.

So to produce an output table, you have 5 x 70 = 350 intersection-and-sum sql queries to run. There are a few basic queries to run to get related data for calculations, so it's going to be ~400 queries in total.

To complete the table, there are 3 calculations to perform on each result, and report in additional columns beside the base number, plus some related calculations - so 1,500 calculations to add.

Then output it all to a browser in table format.

And of course, users want this table spit out in the blink of an eye. This is the part that I wonder if it's reasonable. The speed. This will be Perl DBI and a MySQL db. I'm not Google's development & optimization team.

I knoew there are a lot of variable, but is it within the realm of reason to imagine it possible to code something that will spit out a report in, say, under 5 to 10 seconds? Or is this obviously so impossible that I should just say so and save everyone a lot of heart ache?

Thanks

Forget that fear of gravity,
Get a little savagery in your life.

Comment on Feasability of quick generation of complex report from large db?

Replies are listed 'Best First'.
Re: Feasability of quick generation of complex report from large db? by locked_user sundialsvc4 (Abbot) on Feb 06, 2008 at 15:38 UTC
Several reactions to this... The original table that you ask us to “imagine” is a crosstab (a statistical term, not a database term). This is probably a result format, not an appropriate base-representation. Likewise, you really do not want to have “12,000 tables.” Instead, let's generalize further: You have `widgets` and you have `companies`. One of the attributes of a `company` is that company's `size_factor`. One of the attributes of a `widget` is its `class`. Joining these two tables is a third table, which we might call (say...) `company_widget`, and the purpose of this table is to show, e.g. using a pair of columns named `company` and `widget`, that a particular company uses a particular widget. (If there are particular attributes of that use, that is to say, particular salient details of how a particular `company` uses a particular `widget`, those details would be represented as additional columns in table `company_widget`, because they are attributes of .. not the `widget` alone, nor the `company` alone, but rather of .. the relationship described by that table.) This is what we call a many-to-many relationship. So now, instead of “tens of thousands of” tables, we now have only three. What you're going to do, then, is to run probably one SQL query which uses clauses such as `GROUP BY` to obtain information about a selected company's use of widgets. This result is going to come to you as a sequential list of value pairs. You would then read that and put it into a two-dimensional Perl array, after filling the entire array with initial zeroes. In other words, you are creating the crosstab of the result-list you have just obtained from SQL. HTML templating systems like `Template Toolkit` (in CPAN) can build an HTML table from a Perl hash using just template-directives. Quite probably, this “immensely complicated problem” has just reduced itself to ... trivial. The “hundreds of” queries that seem at first-blush to be needed to solve this problem might well have been reduced to just one.
Re^2: Feasability of quick generation of complex report from large db? by punch_card_don (Curate) on Feb 06, 2008 at 16:35 UTC
Wonderfully rich reply - thanks. But I think, maybe, some miscommunications. To put it into similar terms: I have `companies` (ex: ABC Inc, XYZ Inc, ...) and I have `product classes` (ex: pens, liquid nitrogen, desk chairs, fork lifts, plywood, fourier-transform-infrared-spectrometres,...) yes, one of the attributes of a company is its `size_factor` a `product_class` has no attributes Someone has canvassed ~30,000 companies and asked them: do you belong to industry group X (ex: more than X employees)? do you belong to industry group Y (ex: located in Ontario, California)? do you belong to industry group Z(ex: less than 5 years in business)? do you use product class A? do you use product class B? do you use product class C? ...and so on, for ~12,000 product classes. So, all the info could have been put into one big 30,000 row x 12,000 column table, and then just query `SELECT company WHERE (group_a = 1 and product_class_C = 1);` [download] But I still have to repeat that query for 350 combinations of group and product_class. Originally, there was only going to be a requirement to produce a half-dozen cross-references at a time. So, came the idea of the reverse-index tables - easy and quick to identify the ten tables needed, find the intersection of these little <= 10Kb tables that are often only a few hundred rows long, and sum the size-factors. But THEN someone said "Hey, that's great - and fast - here, do that for this report of 350 intersections." Update: I appear to have at least partially answered my own question - even if it was done the other way, there'd still be 350 queries to run, except each one on a full 30k-row x 12k column table!	[reply] [d/l]
Re^3: Feasability of quick generation of complex report from large db? by talexb (Chancellor) on Feb 06, 2008 at 17:01 UTC
This sounds like a bit of heavy lifting for a database server -- but not an impossibility. What I'm not getting is how often you have to regenerate this big table. Every time a web user wants to look at it? Every time something in the underlying database changes? I would imagine you could generate and then cache the 'latest version' of the table at the end of every day, assuming new data is added throughout the day. Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re^4: Feasability of quick generation of complex report from large db? by punch_card_don (Curate) on Feb 06, 2008 at 17:42 UTC
Re^3: Feasability of quick generation of complex report from large db? by pc88mxer (Vicar) on Feb 06, 2008 at 17:15 UTC
`So, all the info could have been put into one big 30,000 row x 12,000 +column table ...` [download] I don't think you want a 12,000 column table. Just use an 'intersection' table which has just two columns, one for the company and the other for the question they answered in the affirmative. There would not be any rows for questions they answered 'no' to. Your example query can be performed using a self-join like this: `SELECT a.company from CA as a, CA as b where a.company = b.company and + a.answer = 'group_a' and b.answer = 'product_class_c';` [download] (It should be obvious that indexes on the 'company' and 'answer' column would be very helpful.) Note that this method allows you to add more questions without having to change the schema of the table.	[reply] [d/l] [select]
Re^4: Feasability of quick generation of complex report from large db? by punch_card_don (Curate) on Feb 06, 2008 at 17:55 UTC
Re^5: Feasability of quick generation of complex report from large db? by pc88mxer (Vicar) on Feb 06, 2008 at 18:36 UTC
Some notes below your chosen depth have not been shown here
Re: Feasability of quick generation of complex report from large db? by BrowserUk (Patriarch) on Feb 07, 2008 at 01:39 UTC
A couple of thoughts. Your entire 30,000 companies by 12,000 products classes and 500 groups can be stored in a flat file in 45 MB. This can be slurped into memory or scanned record by record. Either way, the results of your 350 queries can be determined in ~100 seconds. The data is stored 1-bit per yes/no answer. 12,500/8 * 30,000 == 44.7 MB. The code to generate a test DB, generate 350 test queries, and run them, is 50 lines. (No formatting of the report is done.) If this is survey data, it is static once the survey is finished. So why does it matter how long it takes to generate the report? You only need run the queries once and store the results for reuse each time the web page is needed. Or better still, generate and store the html. If you want the viewer to be able to sub-select the results produced for them, then generate a separate html snippet for each of the 350 categories and merge the required bits once you know what they want. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: Feasability of quick generation of complex report from large db? by BrowserUk (Patriarch) on Feb 07, 2008 at 15:41 UTC
Having corrected a stupid math error in my quick test, option 1 above now runs in under 5 seconds. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: Feasability of quick generation of complex report from large db? by punch_card_don (Curate) on Feb 08, 2008 at 01:46 UTC
Appreciate all the input. I'm thinking about this project with this page open.....	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.