in reply to Re: SQL database architecture
in thread SQL database architecture

Hmmm... as far as jamming all of the questions into the exam table, I'd argue that you're falling into an all-to-common trap: you're prematurely optimizing. Worse, you're trading potentially useful flexibility and definite ease-of-use for performance... and before you've seen whether or not this delta in performance is important.

Addtitionally, I wouldn't guarantee that there is a significant performance gain in what you describe (squeezing what *should* be an extender-table into columns of a table). Why do I refute the claims to increased performance? Well, I'll break it down into the two main components for database performance: CPU time and I/O (either disk or buffer).

CPU: you're adding the same number of values, either way... by adding 1 out of 50 values in 50 times as many rows, or adding one value for each row (but only 1/50th the number of rows)... it's all the same amount of time spent summarizing whatever data you are querying.

I/O: it's important to understand how databases organize data on disks (and bear in mind that I/O buffers in memory usually mirror the disk structure... just faster, so keep in mind that what I say about disks is pretty much applicable to ram buffers, too). They do so by sticking rows of the table, packed one against another, onto disk blocks. The "wider" the row (that is, the more bytes it takes to store all of the row-data) the fewer rows can fit on any one disk block. The limiting factor in terms of I/O (again, whether from disk or buffer) is the number of blocks that have to be read. If the table is 50 times wider, and 1/50th the number of rows, it will occupy roughly the same number of disk blocks, and therefore, it will take just as long to read in the whole table.

Of course, if the query is highly directed and can make use of indexes, then, again, there's no performance difference.


------------
:Wq
Not an editor command: Wq

Replies are listed 'Best First'.
Re^3: SQL database architecture
by graff (Chancellor) on Nov 04, 2003 at 05:37 UTC
    Yes, these are very good points -- it's the same quantity of data no matter how you slice it up. I just said that the wider table would be the first thing I'd try -- and I did say I'd want to test the other way as well. 8^)

    Since the wide table is just like the original flat file, there's a slight bit less work setting up and loading the DB, but that's not necessarily a good reason to pick it. Meanwhile, with the narrower, longer table, the example queries would be of the form:

    select count(*) from long_table where exam_id=? and question_id=? and +answer=? and student_id in (select student_id from long_table where exam_id=? and question_id=? and answer=?)
    (with the parameters for the sub-query being different from those for the main (count()) query); and for more complicated conditions, doing queries for single "exam_id/question_id/answer" tuples and tracking the intersection through a perl hash array.

    Looking at it this way, I have to agree with you, that the long/narrow table will likely end up being better over all, because a single query with three paramaterized conditions ("where exam_id=? and question_id=? and answer=?") will do the work of 50 different queries that would all have to be constructed and prepared separately on a wide table:

    "where exam_id=? and answer_$x=?" for $x (1..50);
    And of course, there are bound to be flexibility issues that the OP didn't mention (or hasn't realized), and the narrow table will make those easier as well.

    update: As etcshadow surmised below, I am not especially adept at doing SQL joins; in the absence of proper training in RDB methods, I tend to fall back on habits of thought I've learned in other (non-RDB) areas. Up to now my limited skill with joins applied merely to combining distinct data from related tables based on a shared key, not trying to combine "disjoint" conditions on a single flat table, as shown in tilly's reply below; I'm very grateful and much better off for having seen that -- thanks! (And apologies for the off-beat advice above.)

      I would write those queries as:
      select count(*) from answer_table a_1, answer_table a_2 where a_1.student_id = a_2.student_id and a_1.exam_id = ? and a_1.question_id = ? and a_1.answer = ? and a_2.exam_id = ? and a_2.question_id = ? and a_2.answer = ?
      If you have indexes on the exam and question of the table, the database should use that to fetch back just the answers to that question of that exam (the rest of the table won't be looked at), and will then join it very efficiently. This is virtually instantaneous.

      If you don't have any indexes, then you will probably wind up scanning the full table twice. Which is not good, but if you are using it for bulk access, it is not horrible either...

      In short, live by the saying, The key, the whole key, and nothing but the key. So help me Codd. (Until you actually observe, not theorize, the existence of performance problems.)

      That's cool.

      BTW, (and I appologize if this comes off as condescending... that's not how it's intended) are you familiar with "joins" in relational databases? It's very much at the heart of what makes a relational database "relational". I just ask because a few of the different things you said in this thread make it seem like you're not familiar with using joins. All of the "use a hash table in perl" and "where other_table_id in (select id from other_table where ..." make me wonder.

      I ask because, if you're trying to force the issue too much, and avoid using joins where they are appropriate, you could be hamstringing your database a little bit. Most databases have what is known as something along the lines of an "execution path optimizer". That is, you ask the database for information, and the database figures out the best method for retrieving that data. If you descibe the data that you want by using join syntax, the datbase has more freedom to optimize your query. Also, utilizing joins to pull all the info you need in a single query (rather than running multiple queries and joining in perl with hash tables), you can reduce network round-trip time to your database.

      Of course, there are always going to be some good excuses to do your joins in perl rather than on the DB... but the default approach should really be with joins.


      ------------
      :Wq
      Not an editor command: Wq