punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Mysterious Monks,

Does this sound at all reasonable to undertake on a Lamp platform?

Imagine a table of numbers. ~70 rows & 5 columns. The rows represent classes of products, the columns represent groups of businesses. A cell, the intersection of a row and column, is the cross-referenced-population of companies that belong to that group and use that class of product. For example, group = 'widget manufacturers' that use product = 'foo bars'.

You have a MySQL DB of lists of companies. The table 'widget manufacturers' lists all the companies that manufactur widgets. The table 'foo bars' lists all the companies that use foo bars. Each table includes a column for the 'company size factor' for every company in the list. To find the number that goes in the cell at their intersection, you have to run an SQL query to find the intersection of the two tables' 'company' columns, then sum the 'company size factor' for every company in the results.

There are only a few hundred business group tables, and ~12,000 product class tables. But the output table you have to generate will always only include 70 product classes and 5 groups. There are ~30,000 companies. This is a ~100Mb database, although each individual table might only be ~10Kb average, with anywhere from 10 (table 'enriched uranium')to 10,000 (table 'ballpoint pens') rows.

So to produce an output table, you have 5 x 70 = 350 intersection-and-sum sql queries to run. There are a few basic queries to run to get related data for calculations, so it's going to be ~400 queries in total.

To complete the table, there are 3 calculations to perform on each result, and report in additional columns beside the base number, plus some related calculations - so 1,500 calculations to add.

Then output it all to a browser in table format.

And of course, users want this table spit out in the blink of an eye. This is the part that I wonder if it's reasonable. The speed. This will be Perl DBI and a MySQL db. I'm not Google's development & optimization team.

I knoew there are a lot of variable, but is it within the realm of reason to imagine it possible to code something that will spit out a report in, say, under 5 to 10 seconds? Or is this obviously so impossible that I should just say so and save everyone a lot of heart ache?

Thanks




Forget that fear of gravity,
Get a little savagery in your life.
  • Comment on Feasability of quick generation of complex report from large db?

Replies are listed 'Best First'.
Re: Feasability of quick generation of complex report from large db?
by locked_user sundialsvc4 (Abbot) on Feb 06, 2008 at 15:38 UTC

    Several reactions to this...

    • The original table that you ask us to “imagine” is a crosstab (a statistical term, not a database term). This is probably a result format, not an appropriate base-representation. Likewise, you really do not want to have “12,000 tables.”
    • Instead, let's generalize further:
      • You have widgets and you have companies.
      • One of the attributes of a company is that company's size_factor.
      • One of the attributes of a widget is its class.
      • Joining these two tables is a third table, which we might call (say...) company_widget, and the purpose of this table is to show, e.g. using a pair of columns named company and widget, that a particular company uses a particular widget. (If there are particular attributes of that use, that is to say, particular salient details of how a particular company uses a particular widget, those details would be represented as additional columns in table company_widget, because they are attributes of .. not the widget alone, nor the company alone, but rather of .. the relationship described by that table.) This is what we call a many-to-many relationship.
    • So now, instead of “tens of thousands of” tables, we now have only three.
    • What you're going to do, then, is to run probably one SQL query which uses clauses such as GROUP BY to obtain information about a selected company's use of widgets. This result is going to come to you as a sequential list of value pairs. You would then read that and put it into a two-dimensional Perl array, after filling the entire array with initial zeroes. In other words, you are creating the crosstab of the result-list you have just obtained from SQL.
    • HTML templating systems like Template Toolkit (in CPAN) can build an HTML table from a Perl hash using just template-directives.
    • Quite probably, this “immensely complicated problem” has just reduced itself to ... trivial. The “hundreds of” queries that seem at first-blush to be needed to solve this problem might well have been reduced to just one.
      Wonderfully rich reply - thanks.

      But I think, maybe, some miscommunications.

      To put it into similar terms:

      • I have companies (ex: ABC Inc, XYZ Inc, ...) and I have product classes (ex: pens, liquid nitrogen, desk chairs, fork lifts, plywood, fourier-transform-infrared-spectrometres,...)

      • yes, one of the attributes of a company is its size_factor

      • a product_class has no attributes

      • Someone has canvassed ~30,000 companies and asked them:
        • do you belong to industry group X (ex: more than X employees)?
        • do you belong to industry group Y (ex: located in Ontario, California)?
        • do you belong to industry group Z(ex: less than 5 years in business)?
        • do you use product class A?
        • do you use product class B?
        • do you use product class C?
        • ...and so on, for ~12,000 product classes.

      So, all the info could have been put into one big 30,000 row x 12,000 column table, and then just query

      SELECT company WHERE (group_a = 1 and product_class_C = 1);

      But I still have to repeat that query for 350 combinations of group and product_class.

      Originally, there was only going to be a requirement to produce a half-dozen cross-references at a time. So, came the idea of the reverse-index tables - easy and quick to identify the ten tables needed, find the intersection of these little <= 10Kb tables that are often only a few hundred rows long, and sum the size-factors.

      But THEN someone said "Hey, that's great - and fast - here, do that for this report of 350 intersections."

      Update: I appear to have at least partially answered my own question - even if it was done the other way, there'd still be 350 queries to run, except each one on a full 30k-row x 12k column table!

        This sounds like a bit of heavy lifting for a database server -- but not an impossibility. What I'm not getting is how often you have to regenerate this big table. Every time a web user wants to look at it? Every time something in the underlying database changes?

        I would imagine you could generate and then cache the 'latest version' of the table at the end of every day, assuming new data is added throughout the day.

        Alex / talexb / Toronto

        "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

        So, all the info could have been put into one big 30,000 row x 12,000 +column table ...
        I don't think you want a 12,000 column table. Just use an 'intersection' table which has just two columns, one for the company and the other for the question they answered in the affirmative. There would not be any rows for questions they answered 'no' to.

        Your example query can be performed using a self-join like this:

        SELECT a.company from CA as a, CA as b where a.company = b.company and + a.answer = 'group_a' and b.answer = 'product_class_c';
        (It should be obvious that indexes on the 'company' and 'answer' column would be very helpful.)

        Note that this method allows you to add more questions without having to change the schema of the table.

Re: Feasability of quick generation of complex report from large db?
by BrowserUk (Patriarch) on Feb 07, 2008 at 01:39 UTC

    A couple of thoughts.

    1. Your entire 30,000 companies by 12,000 products classes and 500 groups can be stored in a flat file in 45 MB.

      This can be slurped into memory or scanned record by record. Either way, the results of your 350 queries can be determined in ~100 seconds.

      The data is stored 1-bit per yes/no answer. 12,500/8 * 30,000 == 44.7 MB.

      The code to generate a test DB, generate 350 test queries, and run them, is 50 lines. (No formatting of the report is done.)

    2. If this is survey data, it is static once the survey is finished. So why does it matter how long it takes to generate the report?

      You only need run the queries once and store the results for reuse each time the web page is needed. Or better still, generate and store the html.

      If you want the viewer to be able to sub-select the results produced for them, then generate a separate html snippet for each of the 350 categories and merge the required bits once you know what they want.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Feasability of quick generation of complex report from large db?
by punch_card_don (Curate) on Feb 08, 2008 at 01:46 UTC
    Appreciate all the input. I'm thinking about this project with this page open.....
A reply falls below the community's threshold of quality. You may see it by logging in.