in reply to SQL database architecture

If you had any sort of design that put a single answer per row in some table, the nature of the questions being asked would require that the row also identifies the particular student and the particular exam for the given answer. That table would have 500 million rows (20,000 students * 500 exams * 50 answers per exam). A single flat table of this sort would handle the application you describe, but I'd be uncertain about the response time for querying a table of this size.

On the other hand, your existing flat file has 10 million rows (20,000 students * 500 exams) -- a lot less that 500 million, but then each row has an additional 48 columns.

So if some sort of flat table seems appropriate (and frankly, it probably is, given your description), then the choice of flat table design would hinge on relative performance given 50 times more rows vs. 48 additional columns per row.

If it were me, the first thing I'd try would be fewer rows / more columns. In this case, your initial guess about the kind of query operation was a bit off. No doubt you would figure this out yourself as soon as you set to work on it in earnest, but I think the queries would be more like this:

-- count people who said "a" on exam 1,question 20 and -- "c" on exam 43, question 37: select count(*) from the_table where exam_id=1 and answer20='a' and student_id in (select student_id from the_table where exam_id=43 and answer37='c')
Of course, as soon as you need to intersect on three or more different questions, each from a different exam, you'll probably want to try using a perl hash array with a set of simple queries: each query just selects student_id's where 'exam_id=? and answer_X=?", and you increment hash elements keyed by student_ids after each query. Then pick out the hash keys whose value equals the number of queries performed.

It would probably be worthwhile to test the two flat-table alternatives (4 cols * lots of rows vs. 52 cols * fewer rows) on some suitably large sample of data, to see which one is easier/faster/more flexible.

Replies are listed 'Best First'.
Re: Re: SQL database architecture
by etcshadow (Priest) on Nov 04, 2003 at 04:20 UTC
    Hmmm... as far as jamming all of the questions into the exam table, I'd argue that you're falling into an all-to-common trap: you're prematurely optimizing. Worse, you're trading potentially useful flexibility and definite ease-of-use for performance... and before you've seen whether or not this delta in performance is important.

    Addtitionally, I wouldn't guarantee that there is a significant performance gain in what you describe (squeezing what *should* be an extender-table into columns of a table). Why do I refute the claims to increased performance? Well, I'll break it down into the two main components for database performance: CPU time and I/O (either disk or buffer).

    CPU: you're adding the same number of values, either way... by adding 1 out of 50 values in 50 times as many rows, or adding one value for each row (but only 1/50th the number of rows)... it's all the same amount of time spent summarizing whatever data you are querying.

    I/O: it's important to understand how databases organize data on disks (and bear in mind that I/O buffers in memory usually mirror the disk structure... just faster, so keep in mind that what I say about disks is pretty much applicable to ram buffers, too). They do so by sticking rows of the table, packed one against another, onto disk blocks. The "wider" the row (that is, the more bytes it takes to store all of the row-data) the fewer rows can fit on any one disk block. The limiting factor in terms of I/O (again, whether from disk or buffer) is the number of blocks that have to be read. If the table is 50 times wider, and 1/50th the number of rows, it will occupy roughly the same number of disk blocks, and therefore, it will take just as long to read in the whole table.

    Of course, if the query is highly directed and can make use of indexes, then, again, there's no performance difference.


    ------------
    :Wq
    Not an editor command: Wq
      Yes, these are very good points -- it's the same quantity of data no matter how you slice it up. I just said that the wider table would be the first thing I'd try -- and I did say I'd want to test the other way as well. 8^)

      Since the wide table is just like the original flat file, there's a slight bit less work setting up and loading the DB, but that's not necessarily a good reason to pick it. Meanwhile, with the narrower, longer table, the example queries would be of the form:

      select count(*) from long_table where exam_id=? and question_id=? and +answer=? and student_id in (select student_id from long_table where exam_id=? and question_id=? and answer=?)
      (with the parameters for the sub-query being different from those for the main (count()) query); and for more complicated conditions, doing queries for single "exam_id/question_id/answer" tuples and tracking the intersection through a perl hash array.

      Looking at it this way, I have to agree with you, that the long/narrow table will likely end up being better over all, because a single query with three paramaterized conditions ("where exam_id=? and question_id=? and answer=?") will do the work of 50 different queries that would all have to be constructed and prepared separately on a wide table:

      "where exam_id=? and answer_$x=?" for $x (1..50);
      And of course, there are bound to be flexibility issues that the OP didn't mention (or hasn't realized), and the narrow table will make those easier as well.

      update: As etcshadow surmised below, I am not especially adept at doing SQL joins; in the absence of proper training in RDB methods, I tend to fall back on habits of thought I've learned in other (non-RDB) areas. Up to now my limited skill with joins applied merely to combining distinct data from related tables based on a shared key, not trying to combine "disjoint" conditions on a single flat table, as shown in tilly's reply below; I'm very grateful and much better off for having seen that -- thanks! (And apologies for the off-beat advice above.)

        I would write those queries as:
        select count(*) from answer_table a_1, answer_table a_2 where a_1.student_id = a_2.student_id and a_1.exam_id = ? and a_1.question_id = ? and a_1.answer = ? and a_2.exam_id = ? and a_2.question_id = ? and a_2.answer = ?
        If you have indexes on the exam and question of the table, the database should use that to fetch back just the answers to that question of that exam (the rest of the table won't be looked at), and will then join it very efficiently. This is virtually instantaneous.

        If you don't have any indexes, then you will probably wind up scanning the full table twice. Which is not good, but if you are using it for bulk access, it is not horrible either...

        In short, live by the saying, The key, the whole key, and nothing but the key. So help me Codd. (Until you actually observe, not theorize, the existence of performance problems.)

        That's cool.

        BTW, (and I appologize if this comes off as condescending... that's not how it's intended) are you familiar with "joins" in relational databases? It's very much at the heart of what makes a relational database "relational". I just ask because a few of the different things you said in this thread make it seem like you're not familiar with using joins. All of the "use a hash table in perl" and "where other_table_id in (select id from other_table where ..." make me wonder.

        I ask because, if you're trying to force the issue too much, and avoid using joins where they are appropriate, you could be hamstringing your database a little bit. Most databases have what is known as something along the lines of an "execution path optimizer". That is, you ask the database for information, and the database figures out the best method for retrieving that data. If you descibe the data that you want by using join syntax, the datbase has more freedom to optimize your query. Also, utilizing joins to pull all the info you need in a single query (rather than running multiple queries and joining in perl with hash tables), you can reduce network round-trip time to your database.

        Of course, there are always going to be some good excuses to do your joins in perl rather than on the DB... but the default approach should really be with joins.


        ------------
        :Wq
        Not an editor command: Wq