Once again, a fairly off-topic post. Please ignore if this offends.
I working on a redesign of a Web site that was put together rather haphazardly. The old site worked primarily because the person who wrote it knew the database quirks and could code around them (usually). I'm trying to streamline everything, use Template Toolkit, create a nice, OO interface to the database -- and am repeatedly running into my DBA who claims that I'm being too strict regarding normalization.
Example: we have a products tables with 270 products. These products fall into about 10 categories and these categories have been repeated in the products table rather than moved into their own table. This issue is common throughout the database. The DBA claims that the tables are so small that it's not a serious issue and that by not normalizing the database, we can eliminate an extra join and gain some performance.
Huh?
The "performance gain" is lost as my Perl code has "be aware of" these issues (yes, we have several instances of this). If I'm going to update anything, I've need to make sure that I don't "add" a category by misspelling it, for example. Furthermore, one article states:
As with many formal rules and specifications, real world scenarios do not always allow for perfect compliance. In general, normalization requires additional tables and some customers find this cumbersome. If you decide to violate one of the first three rules of normalization, make sure that your application anticipates any problems that could occur, such as redundant data and inconsistent dependencies.
Unfortunately, it doesn't state under what circumstances one might realistically decide not to normalize. Am I following the rules too blindly? In one instance, the DBA stated that the continents can be in the country table because the continents are never going to change. We've argued these issues repeatedly, but I need to step back and get second opinions. This is for a very important client, but it's a fairly small database.
Cheers,
Ovid
Vote for paco!
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re (tilly) 1: Database Design Issues - OT
by tilly (Archbishop) on Jul 13, 2001 at 04:44 UTC | |
But it sounds like it. First a work about optimization. Don't! Next an explanation about that statement. Believe Knuth when he says, Premature optimization is the root of all evil. For the full explanation read what Code Complete has to say on the topic. Your DBA is definitely making the mistake of trying to optimize as he goes. If you are going to avoid using your database as a relational database, then you are better off using flat files. OK, for maximum read-only performance you are probably going to be best off using dbm files. Note that this is not a recommendation. (Much as I like and respect what Sleepycat has accomplished.) Instead stop thinking fancy words like "normalization" and think about this as a programming problem. All of the important reasons why normalization are good have to do with the basic reasons why you wouldn't want to duplicate information in any other context. Normalizing a database is another aspect of creating a well-factored program. OK, so your basic tools for modularizing what goes on are slightly different, but if you are trying to put a modular front end on it (which it sounds like you are) then the need to modularize the database is just an extension of the need to modularize your whole system. Now what people who first face a well-normalized database will first see is similar to what they see when they face a well-normalized program. There is a large amount of structure, and you need to constantly work within the implied rules. When you just want to get something done, the normalization is rarely going to make your life easier. However when you are trying to build a complex system, you will constantly find the normalization making it easier to "chunk" up to higher levels. And then when you need to change something important, it is going to be much easier for you to make that change and you will be able to make it with far less fears that you will be spending the next 6 months tracking down oopses from other things that need to be changed. (That doesn't mean that you won't spend 6 months doing that, you will. But it will happen less often and you will be free to consider far more fundamental changes in your application when the need arises.) And if your modularization does cost you too much performance (unlikely if you have arranged to have a decent database and you are joining on keys, and joining against views where that makes sense), well then you optimize that after the fact. Just like you would for any other programming project. After all you don't generally know ahead of time whether you will need to optimize. If you do, the odds are very low that you would have guessed correctly where. But you definitely know what the costs of not modularizing up front are. And the losses in terms of reliability, development speed, etc are both significant and readily predictable. But databases are somewhat different. What is different about databases is that in the real world you tend to run into the need to reconsider performance in a database more often than you would many other kinds of software. Any kind of modularization tends to work through introducing abstraction layers. This typically creates run-time work. Now as a link that merlyn pointed me at says, late binding is a good idea. However it costs performance. And frequently when you are using a database, you are dealing with very large amounts of data. Therefore performance is a real issue. When that happens, you either need to train your users to believe that the performance is normal, or you need to improve performance. Or both. But with a decent database engine and a well-normalized database with indexes for all of your important joins, it is unlikely to be an issue until you have hundreds of thousands or millions of records. For instance I have a friend who works tracking consumer data. His databases tend to be in the terabytes, and they are that small only because he can't (yet) afford more disk space. (Ever thought about tracking all of the details on every credit-card purchase in the US? He would like to...) Needless to say, he frequently is forced to denormalize his data for performance reasons. However if he could normalize and have it run acceptably well, he would. In a flash. And the parts that he can normalize, he does. But when time comes to improve performance, your DBA's thoughts on what is fastest are likely to be wrong. As Steve McConnell said, you never know what is going to work out faster. A real life example. I recently decided to try to speed up a view I use. It selects the most recent date I can use by trying to do a join of three tables looking for the most recent date it can find in all 3. So I figured that it had to be faster to replace it with a table with one value. I mean it is one answer, and most queries against the view only want to pull out data from one of the tables. It must be faster to join 2 tables (one of them with only one element) together than to do the more complex join on all 3 before joining back to one of them. Dead obvious, right? Wrong! What apparently is happening is that every query I do against said view goes against one or more of the 3 tables in the view. But Sybase is smart enough to keep a cache of partial results it needs in computing the view, and smart enough in my queries to join the result of the view against the partial tables and not the full original. So my queries were faster with the complex 3-table join because there is a non-obvious cache that things get to run against! (The production problem that caused the issue which I was concerned about was solved with a memory upgrade that we needed for other reasons.) Now a final note. While I am clearly fond of normalizing databases, it is not an unmitigated good. This is an important point which is far too often forgotten. We so want to tell aspiring programmers the advice that is usually good that we fail to tell them that none of this advice always hols. This point is one which I think I best made at This is not a flame. It is unfortunate that it was made in the middle of a heated discussion, and my apologies to princepawn for bringing it up again. But I think the points that I was trying to get princepawn to recognize are generally important, and relevant in particular to this thread. But back to the case at hand. I do think that normalization of information is a good thing to do in your situation. It sounds like you may have to work to get your DBA to understand this fact. While doing that try to keep in mind that your DBA is probably not used to thinking like a good programmer does. What he sees is always the immediate up front cost of normalizing. What he misses is the persistent downside from not normalizing that will kill your productivity in the end. Try to make him see and understand that... | [reply] |
Re: Database Design Issues - OT
by footpad (Abbot) on Jul 13, 2001 at 03:41 UTC | |
While your DBA has an, um--er, *cough*, interesting point, I personally think you've already hit upon the main reasons for doing it correctly: I don't think you even know if it would save any appreciable amount of time or if there's a problem that needs to be solved. Would it not be wiser to do it right first and then see if there's a problem? His idea is already causing more work, jacking up the expense, and risking data integrity before anyone enters a single record. If it turns out the second join does cause a problem, there are other approaches that don't risk the data integrity as severely. For example, what if you loaded your category lookup into an array and then printed that value when printing the query results? A tiny performance hit, perhaps...but nowhere as risky as crippling your data integrity from square one. In my experiences with other databases, I've found that "performance improvements" are applied far earlier than they should be. As an excuse, it frequently gets used to mask sloppy programming or false-laziness. (I've also found that very few people understand normalization and why it's important, but that's another rant.) Yes, the *are* times to break Codd's Rules--but not many. And the typical reasons for doing so aren't appropriate. IMHO. Remember the first rule of The Pragmatic Programmer:, "Don't Repeat Yourself." --f | [reply] |
(jeffa) Re: Database Design Issues - OT
by jeffa (Bishop) on Jul 13, 2001 at 02:27 UTC | |
If the database will remain small, then choosing not to normalize for a perfomance gain smells like false optimization to me. When i worked at a very large IT center for a nameless accounting firm, i noticed that NONE of their tables were normalized and none had any referencial integrety. They claimed it spead up searches - i claimed it slowed down code production (hey - they also claimed finding child nodes in their org chart was best accomplished by using the LIKE operator *shudder*). And as for the the continents . . . well, what if you want a list of available continents? You really should put them in their own table (SELECT DISCTINCT(CONTINENT) FROM COUNTRY is just plain silly). But i am a big fan of normalization, other opinions may differ. Jeff
| [reply] |
Re: Database Design Issues - OT
by Masem (Monsignor) on Jul 13, 2001 at 02:31 UTC | |
Going along with this, I'd stick to creating a catagoryID index value, and sticking the catagories into their own table, with the products using the catagoryID in their table. Doing this now while the database is small will save headaches later when the database is large. Also, you'll have a much easier time of expanding the system later.
Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain | [reply] |
Re: Database Design Issues - OT
by cforde (Monk) on Jul 13, 2001 at 03:28 UTC | |
As you might expect, opinions vary on 3nf's benefits. Typically, Data Administrators like it, Database Administrators don't. The reason for this is that Data Administrators want to model the "world" in a way that exposes all the underlying relationships in the data and don't care about performance (that's an implementation issue, not a model issue). The Database Administrator cares about presenting the data in the most "useful" way while giving acceptable performance (how the data will be used will affect this greatly). Clearly these two perspectives are in conflict. My suggestion is to start with 3nf and denormalize as necessary for performance. If this application grows, inevitably other tables will need to be created and the closer to 3nf the original model is the easier the integration of these new tables will be. Having said that, I don't think it is necessary to normalize things like addresses. eg. name, city, state and zip could be split into multiple tables because zip refers to the city and a city can have multiple zips. For those who don't know, 3nf can be summarized as: the content of a row is about the key, the whole key and nothing but the key. (so help me Codd) The "key" of course is the primary key. Notice how that is violated in the address example. disclosure: I'm a former DB2/VM DBA & system programmer.
Have fun, | [reply] |
Re: Database Design Issues - OT
by voyager (Friar) on Jul 13, 2001 at 02:33 UTC | |
If you have 270 products (rows?) and 10 categories, performance is not an issue, so a seperate table should be ok. But with a seperate category table, you will have to build maintenance panels, etc. for each additional table. I've seen instances where the dba doesn't want to continually add lookup/validation tables. What we do is have one table with an additional column that is the field begin validated, and then we have one lookup table for any number of fields. Another solution I have used is to have two entry fields: one text, one select. The select is populated with a "select distinct col from table", and the first option is "Existing values". The cgi code uses the value from the select if it has been changed, or from the text otherwise. At least this way the user can look at/choose existing values without worrying about typos. | [reply] |
by Abigail (Deacon) on Jul 13, 2001 at 16:11 UTC | |
It's not just the amount of data that determines performance, it's also the number and kind of transactions. If all do is 10 simple selects a day, I agree, performance is not an issue. If you need to do 10,000 queries an hour, performance is an issue, even if you have only 270 products. Then not doing the join can certainly be noticeble. I've been a DBA in a previous job. My response to the original question is "the article contains not enough information to say whether the DBA is right or not". What (s)he says certainly does have merit. But it depends on a lot of things whether the performance gain is noticeble or not. How many selects? How many updates? How many concurrent users accessing the tables? How much cache? What else is going on? It requires an awful lot of knowledge, not only from the strategies of your database server but also from which transactions are going to be performed to say which will give the best performance. Perhaps only realistic testing can tell the difference. And then you still need to make the trade-off between performance and clean design. Note that the drawback (I could make a typo when inserting something) can be overcome in several ways. One could make an insert trigger that rolls back the transaction if an illegal category is inserted. Or one uses a list of constants in the code, or one makes a layer between application(s) and the database that does the actual creation of SQL - and never have applications touch the database directly. The layer could exists of some libraries (modules), but it could also exists of stored procedures. Or some combination. -- Abigail | [reply] |
by adamsj (Hermit) on Jul 13, 2001 at 18:53 UTC | |
What she said. Speaking as a DBA who doesn't spend all his time being a DBA, I love normalized data. I also understand why sometimes denormalizing is the right strategy, and that there are ways to maintain integrity while denormalizing. They laughed at Joan of Arc, but she went right ahead and built it. --Gracie Allen | [reply] |
Re: Database Design Issues - OT
by adamsj (Hermit) on Jul 13, 2001 at 02:29 UTC | |
Why is normalization a good thing? It saves space and eliminates anamolous data. In small tables, the space saving is not such a big deal. In tables that aren't updated, the anamolies are more easily avoided. There can be a performance gain by avoiding joins and denormalization is often employed for exactly this reason. If your apps are doing only selects against a small table, then I'd say go with the DBA on this question. However, if you're talking about doing inserts and updates against the table, then you have a different situation. One approach is to keep data in normalized tables for maintenance, then do joins to create denormalized tables to run against. Sometimes the normalized tables are in a separate database, joined there, then exported. By the way, what RDBMS are you using? They laughed at Joan of Arc, but she went right ahead and built it. --Gracie Allen | [reply] |
Re: Database Design Issues - OT
by toma (Vicar) on Jul 13, 2001 at 09:01 UTC | |
For example, I ran a query to find unreasonable values in a database that held data about electronic components. I found many problems quickly. After that, a normalized version of the database was implemented with many sanity checks and metadata stored in tables. The good news is that the new database has much higher data quality. A user interface to the database can be created directly from the meta-data. The bad news is that the sanity checks require maintainance, and you need to join at least six tables to get anything out of the database. Many users of the database don't understand the schema, so they tend to dump large portions of it (effectively denormalizing it themselves). Text filters are then used to get at the data. Normalized databases require deeper understanding of the data. Many people will gladly work hard so that they don't have to understand something. The highest performance approach that uses this wonderfully normalized database queries the database in a mod_perl program, creating a denormalized, lightning-fast, easy-to-search data structure for accessing the data. Programs that need to update the database still have to work with the complicated schema. To know that this approach would be useful, we had to know something important in advance: we wanted high-speed queries and we knew that updates would only occur in a well-controlled nightly batch job. A deep understanding of the fundamental problem that your project is trying to solve will enable you to do a much better job of analyzing the pros and cons of design issues. It's great to think things like "Normalization is almost always good." It's even more impressive to know something about the application that will show how well your approach will work in the long run. It is a really good sign that you are willing to ask about such deep issues and keep an open mind. Keep thinking! Avoiding these issues is false laziness. Update: Tried to fix some grammar. It should work perfectly the first time! - toma | [reply] |
Re: Database Design Issues - OT
by simon.proctor (Vicar) on Jul 13, 2001 at 12:32 UTC | |
I hate to say it but I think that the only case for denormalisation comes once you have a fully normalised database - in other words at the end during the performance test phase. Not earlier. Whenever I do any denormalisation, I tend to replicate IDs to simplify my joins to a smaller number tables (and have my code pull the IDs across) or I build a new table based on the type of data I am using (normally I use this just for reporting). I have to agree with the other posts relating, mainly, to scalability. You don't know what will happen to the code and db once it leaves your grasp - but when it fails it will fall to you to fix it. As an aside - I always build category <-> product relationships along the following: CATEGORY ~~~~~~~~ ID name parent_ID PRODUCT ~~~~~~~ ID name ..... PRODUCT_CATEGORY_LINK ~~~~~~~~~~~~~~~~~~~~~ category_ID product_ID What do you think? If you have a flat list of categories - all your 'parent_ID' entries will be 0 (default them to 0 in your DB design). I'm not saying this is the best way to do it but when people have argue with me (during various projects) I have managed to argue this structure well enough to keep it :) Good luck :) | [reply] |
Re: Database Design Issues - OT
by runrig (Abbot) on Jul 13, 2001 at 02:38 UTC | |
This made simple searches to find things between two years into a horrible join that caused full table scans. I convinced them to just store the year itself in the YEAR table, and make it the primary key. | [reply] |
Re: Database Design Issues - OT
by AidanLee (Chaplain) on Jul 13, 2001 at 08:46 UTC | |
| [reply] |
(jptxs)Re: Database Design Issues: on roasting DBAs
by jptxs (Curate) on Jul 13, 2001 at 19:55 UTC | |
As for the heart of the issue, I think the one point being missed here is the age old struggle between the different camps in any IT organization. Speaking as aomeone who often has to straddle these fences, remember that when approaching these problems you need to take a look at why the DBA might be saying what he's saying. tilly says that we need to take a step back and look at this as a programming problem. While I whole-heartedly agree that gets us to the right conclusion (i.e. to normalize mercilessly), what it does not do is get you any closer to getting the DBA to do that. I deal with a lot of DBAs, and a DBA only cares that the DB perfroms well and is up. If your code is slow b/c of structures in his DB that is not really his problem. When you suggest a large normalization effort, right or wrong, he hears things running slow in the DB and therefore him getting the short end of the stick. What you must do is to be sure and approach the problem with his concerns in mind. Don't demand or simply refer to rules and accepted policy; he doesn't care about that stuff =). Be sure to tell him that it will increase performance on the whole and therefore no one will be coming after him just b/c transactions may spend more of their time in the database side than the application side - if the overall transaction time is better no one will care! Make sure he understands that you have just as great an interest in maintaining the saner structure so that you can assist in those tasks when it's needed and that you'll help from a budget perspective if that means buying some tools to help with that (if you can). Basically, walk in his shoes for a bit, then re-approach. He is more than likely just trying to be sure he's doing his job - even if it seems he's not too aware of how it should be done. We speak the way we breathe. --Fugazi | [reply] |
Re: Database Design Issues - OT
by Anonymous Monk on Jul 13, 2001 at 20:09 UTC | |
B African American W Caucasian I Native American M Hispanic | [reply] |