Re: OT: benefits of database normalization
by dws (Chancellor) on Oct 03, 2004 at 01:37 UTC
|
The data column's meaning changes depending on the value of the type column.
One phrase that comes to mind is "job security". By constructing tricky schemas like the first one, someone makes themselves extra "useful" writing the overly complicated query code necessary to access data correctly, at least until they're taken out and shot by the people who have to maintain and extend the result.
The problem with columns that change their types or meanings depending on other columns (like the type column above) is that complexity gets pushed out in all directions. Anyone who queries the data needs to take on the complexity of deciphering the type field, plus whatever other contortions are necessary to process the data. Standard tools, and standard approaches stop working cleanly. And the more time passes, the more people forget why the poor design was chosen in the first place, which can make it harder to undo the damage.
When I see stuff like this, it's often because someone hasn't gotten over their fear of database JOINs, which can be scary until you've done them a few times. However, JOINs are an essential tool. Trying to avoid them is like being afraid to take off the training wheels on a bicycle.
| [reply] |
Re: OT: benefits of database normalization
by Anonymous Monk on Oct 03, 2004 at 03:59 UTC
|
I agree with you on Design 1 that, while having the context of 'fact' change based on 'type' may work, I think it is going to cause more problems down the road than Design 2. If you are going to do validation on 'fact', you need to check what the 'type' is. If you do any aggregates/statistics on 'fact', you need to make sure you build your query so that type #2 is included properly. It can be done, but it seems like more work to me.
Personally, I would work Design 2 more along the lines of:
Design 3
fact_id fact
------- ----------------------
1 foo is a fooity foo
2 bar is a barrish baaar
name_id name fact_id
------- ---- -------
1 foo 1
2 bar 2
3 foog 1
It may just be a personal preference, but I like to keep all my names in one place, and all my facts in one place. I'm interested whether you had any reasons to use Design 2 rather than Design 3?
| [reply] [d/l] |
|
|
I like that design, but there is a difference here: you can't make the distinction between aliases and "normal facts" in this layout. That may be very useful, but it depends on the problem.
| [reply] |
|
|
Nothing prevents you from adding a "Type" column to the second table to distinguish the two, if you need to. That way you may make the distinction if you want or treat them the same if you don't. Let's see the simplest query "find the description for this name, I don't know whether its a base name or an alias".
First design:
SELECT fact
FROM Facts
WHERE name = ?
and Type = 1
UNION
SELECT BaseFacts.fact
FROM Facts as BaseFacts
JOIN Facts as AliasFacts ON BaseFacts.fact_id = Convert(int,AliasFac
+ts.data)
WHERE AliasFacts.name = ?
and AliasFacts.Type = 2
and BaseFacts.Type = 1
-- and this all works only if you do not ever have an alias to an alia
+s!
Second design:
SELECT fact
FROM Facts
WHERE name = ?
UNION
SELECT fact
FROM Facts
JOIN Aliases ON Aliases.fact_id = Facts.fact_id
WHERE Aliases.name = ?
Third design:
SELECT fact
FROM Facts
JOIN Names ON Names.Fact_id = Facts.fact_id
WHERE Names.Name = ?
Which one do you like best? ;-)
Jenda
We'd like to help you learn to help yourself
Look around you, all you see are sympathetic eyes
Stroll around the grounds until you feel at home
-- P. Simon in Mrs. Robinson |
| [reply] [d/l] [select] |
|
|
Actually, I wouldn't put that information in the second table, but in the first:
Design 3a
fact_id fact orig_name_id
------- ---------------------- ------------
1 foo is a fooity foo 1
2 bar is a barrish baaar 2
name_id name fact_id
------- ---- -------
1 foo 1
2 bar 2
3 foog 1
If you then say orig_name_id must be unique and not null and must reference an existing name_id, the DBMS will automatically check data integrity for you instead of you having to code an extra function to make sure you don't have too few or too many name_ids that claim to be the original. That is one of the main reasons you would want to normalize your data structure: To be able to utilize the DBMS's built-in functions to ensure data integrity. | [reply] |
|
|
I would have made (fact_id, name_id) into a two-column primary key:
fact_id fact
------- ----------------------
1 foo is a fooity foo
2 bar is a barrish baaar
fact_id name_id name
------- ------- -------
1 1 foo
1 2 foog
2 1 bar
Then you could always find the primary name out by querying WHERE fact_id = ? AND name_id = 1. | [reply] [d/l] [select] |
|
|
Ah, I like this design even better. It works similarly to many filesystems. The fact is like an inode, and the name is like a link. Which is actually sort of interesting, because the person I was originally discussing with kept bringing up filesystem analogies. Neither he nor I ever took the analogy to its logical conclusion, though. Many thanks for the reply!
| [reply] [d/l] [select] |
Re: OT: benefits of database normalization
by Velaki (Chaplain) on Oct 03, 2004 at 02:57 UTC
|
A mantra heard frequently in the database world is:
-
The data depends on the key, the whole key, and nothing but the key.
In turn, this leads to frequently complex normalization; however, that can be a good thing in terms of isolating functionality, relationships, and even implementation.
How normal is normal? It depends. Many's the time I see data stored very normalized for complex OLAP manipulation; whereas, I see it flattened out for performance reasons. It's a trade-off. Only you can decide.
Hope that helped,
-v
"Perl. There is no substitute."
| [reply] |
|
|
My take is:
- To fully normalize your opeartional database.
- To denormalize your data warehouses, for mainly performance reason. (the chance that data go out of sync is much lower, as most likely you only add data to data warehouse, but don't update.)
I am everyday dealing with a not normalized operational database, the biggest problem we facing is "out of sync'd data". From time to time, when my users complaining some data is wrong, I go check database, some of the data is stored in 4 or 5 different tables, and they simply have different values, which one is right? which one is wrong? That's a living hell. You have to do database design right at the beginning, don't expect to fix anything later, that is just not possible.
| [reply] |
|
|
How normal is normal? It depends.
True, not everything needs to be normalized to 5NF, but there are benefits to each level of normalization. I was wondering if there were any convincing benefits of Design 2 over Design 1. I like Design 2 much better, but convincing someone else why proved difficult for me.
Basically, I'm curious about the justification for that particular type of normalization. It would probably help greatly if I knew what type of normalization I was talking about, but I don't. That's why I posted the examples. Based on some of the responses, though, they may not have been as helpful as I hoped they would.
| [reply] |
Re: OT: benefits of database normalization
by Joost (Canon) on Oct 03, 2004 at 10:45 UTC
|
A big practical problem with design 1 is that you can't easily let the database guarantee integrity (i.e. make sure that an alias always refers to a fact). Ofcourse you can add triggers and procedures to make it so, but in design 2 you can just declare fact_id a foreign key, and be done with it.
| [reply] |
Re: OT: benefits of database normalization
by saberworks (Curate) on Oct 03, 2004 at 09:02 UTC
|
There are so many reasons to avoid using single columns to hold different types of data. For example, what if you wanted to pull all the data out sorted by "data?" Or what if you wanted to pull out all the unique bits of data? You end up doing crappy type conversions with built-in functions and basically abusing what a database is for. | [reply] |
Re: OT: benefits of database normalization
by jZed (Prior) on Oct 03, 2004 at 01:57 UTC
|
I'm afraid I don't understand your schema. What is a "type"? Is it always just 1 or 2, with 2 meaning that it's an alias and that the data should be interpreted as a fact_id? Or are there other types? If there are other types, how are they represented in design #2?
Also, btw, both "name" and "type" are SQL reserved words and should not be used as column names unless they are quoted. | [reply] |
|
|
Really? I looked at the ANSI SQL-92, SQL-99 and SQL-2003 reserved words lists, and "name" and "type" weren't listed in any of them
| [reply] |
|
|
| [reply] |
|
|
I'm afraid I don't understand your schema . . . are there other types? If there are other types, how are they represented in design #2?
There are only two "types". There are facts, and aliases. I don't really think of them as types, though. I think of them as completely different things. This is the crux of the difference between the two designs.
2 meaning that it's an alias and that the data should be interpreted as a fact_id
Yes, this is how Design 1 was intended to work.
Also, btw, both "name" and "type" are SQL reserved words and should not be used as column names unless they are quoted.
True. They were really just for the sake of the example.
| [reply] |
|
|
create table temp(type number(1), name number(1))
| [reply] [d/l] |
|
|
The fact that Oracle lets you get away with using those doesn't mean that it is a good idea to use them anymore than the fact that MSIE lets you get away with sloppy HTML means that you should write sloppy HTML. If there's any chance that the schema will be ported to another RDBMS in the future (and the OP is asking questions about the long range), then it is better to avoid ISO/ANSI SQL reserved words. Another reason to avoid them is if the schema will be used with parsing, cataloging, or validating software which may or may not recognize them.
| [reply] |
Re: OT: benefits of database normalization
by CountZero (Bishop) on Oct 03, 2004 at 11:57 UTC
|
May I suggest a third design?
NamesAndAliases |
| NAA_id | Name | Definition_id |
| 1 | Foo | 1 |
| 2 | Foo_bis | 1 |
| 3 | Bar | 2 |
Definitions |
| Def_id | Definition |
| 1 | A Foo-ish thing |
| 2 | A Bar-ish thing |
All names and aliases are in one table, if you wish, you can add another column to distinguish between names and aliases, although I see no real necessity. All definitions are in a second table, so updating this table automatically updates the definitions of all related names and aliases.
Update: I only now see that I said the same as the AM in Re: OT: benefits of database normalization
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] |
Re: OT: benefits of database normalization
by aquarium (Curate) on Oct 03, 2004 at 12:38 UTC
|
Design 1 is NOT normalized, I agree. The biggest practical problem with non-normalized databases is that Add/Delete/Update become complicated. Here's an example of the problem with design 1: lots of facts and aliases already loaded. when you delete a fact, you have to scan the whole table to delete all matching aliases. same problem when you update the key column of a fact. Also, you can add a alias that doesn't point to a valid fact. So all this logic gets put into the code of the application, instead of having a normalized database that enforces these relationships. Most such happy coders get the code 99% right, and one day the %1 sneaks in, and someone has to clean up the mess and build it right (normalized) as it should have been.
There are cases when de-normalization is used, in strict control, and only for specific reasons....Perhaps "they" should be giving you reasons WHY to use a de-normalized model; instead of you justifying normalization.
BTW you can easily port a system between different languages and databases when the db is normalized...an application centric (non-normalized db) system is not easily ported. | [reply] |
|
|
I would say the same thing as well.
The problem with the non-normalized design of Model 1 is that the meaning of the data depends on your one application. Any other application that ever needs to access the database would have to recreate the routines in your application to derive meaning from the stored values.
That goes for not just SELECT statement, but what happens in the first model when a dictionary word is deleted? You are likely to have a lot of orphaned aliases. You would have to manually delete all related entries, or ensure that id number was never used again, even though it would now be free and unique. Not just your app, but for every other application that needed to work with this information ever. It's likely data will be corrupted over time.
When a database is properly modelled, the meaning, context, integrity, etc, is inherent in the data itself. It can be trusted to be right.
| [reply] |
Re: OT: benefits of database normalization
by exussum0 (Vicar) on Oct 03, 2004 at 11:51 UTC
|
I work with design one day to day, and I'll tell you why it's not ideal. If it lacks a single column primary key, some databases, like oracle, make it harder to do an outter join. The way to get around the oracle error is..
select
*
from (
select
a.pk1
from
table a,b
where
a.pk1 = b.pk1 (+)
) a_outter_joined,b
a_outter_joined.pk2 = b.pk2 (+) and
(a_outter_joined.pk1 = b.pk1 or a_outter_joined.pk1 = null )
You think you'd be able to do..
select
*
from
a,b
where
a.pk1 = b.pk1 (+) and
a.pk2 = b.pk2 (+)
Maybe I've assumed that there's a missing single key column that was just left out to make an illustration, but I rather not assume... Oracle 8 generates this error for the above query.
Update: Updated which version of oracle gens the error and which query does what.
----
Then B.I. said, "Hov' remind yourself
nobody built like you, you designed yourself"
| [reply] [d/l] [select] |
|
|
"If it lacks a single column primary key, some databases, like oracle, make it harder to do an outter join."
I am with dragonchild on this, and your assertion is not true. I deal with Oracle everyday, and never heard of this, as a matter of fact, most of our tables don't have single column primary key.
Go even deeper, personaly I am against the idea of using artificial numbers as keys. If the primary key takes 3 or 4 columns, just take. In this situation, some people will make up a sequence number as single column primary key, I don't.
Something a bit off topic, I hate using current time as part of the primary key. In this system I am supporting now, there are several tables are like this, and every time when two users try to insert to the same table at same time, one of them get kicked because of unique constraint. That's ugly, and is a design defect.
| [reply] |
|
|
If it lacks a single column primary key, some databases, like oracle, make it harder to do an outter join.
Could you back up this assertion? I've never heard this and I work with an Oracle9i database that has no primary keys whatsoever. (Long story ... I have no control over it.) If this were true, I and the DBA could take it to our VP as reason #2348769 as to why the application developers shouldn't have control over the schema.
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] |
|
|
The oracle error is "ORA-01417: a table may be outer joined to at most one other table."
Someone else has the problem
here. We use 9i at work as well, and I'll double check version numbers to get you something more specific in terms of version numbers in 24 hours. But last I checked, when I tried it, it barfed that error at me.
----
Then B.I. said, "Hov' remind yourself
nobody built like you, you designed yourself"
| [reply] |
|
|
|
|
Re: OT: benefits of database normalization
by pg (Canon) on Oct 03, 2004 at 01:38 UTC
|
Cannot see any problem with design 1. Just base on what you gave, I have to say that it is normailized.
It would become not normailized, for example, if you also have type description in this same table.
Ideally, and I guess design 1 does, you shall have another table called type, which stores things like type id, type description etc. And the type column of the fact table shall be a foreign key points to the type number column of type table.
I don't like design 2.
| [reply] |
|
|
Cannot see any problem with design 1.
You don't think it's wrong to mix two different things into one table? Or are you of the opinion that Aliases and Facts are two "types" of the same thing?
I have to say that it is normailized.
Surely one column's meaning changing depending on the value of another is a violation of some rule of normalization? Like I said, I don't know the names of all the rules, and what they mean exactly, but this seems wrong.
I don't like design 2.
Any particular reasons?
| [reply] |
|
|
My understanding was based on the name of the columns: "type". I don't deny that there is a chance of misunderstanding. Column type might mean something that really not sort of "type". To really appreciate whether design 1 is actually normalized, I have to judge base on the meaning of the true data, which you didn't really specify in your original post. (It would be much easy if we are in front of a whiteboard now ;-).
To clarify, what I meant was: (I have to use something that is more meaningful to everyone, something that is part of everyone's daily life, and yet describes my understanding of your design 1. ProductOf is my understanding of your "type".)
| Fruit | ProductOf | Taste |
| Apple1 | 1 | Some Taste |
| Apple2 | 2 | Some Taste |
| Orange1 | 1 | Some Taste |
| country code | country name |
| 1 | US |
| 2 | Canada |
This is normalized, but will be not normalized, if:
| Fruit | ProductOf | country name | Taste |
| Apple1 | 1 | US | Some Taste |
| Apple2 | 2 | Canada | Some Taste |
| Orange1 | 1 | US | Some Taste |
| [reply] |
|
|
Re: OT: benefits of database normalization
by herveus (Prior) on Oct 04, 2004 at 12:02 UTC
|
Howdy!
In the first design, the meaning of the "data" column depends on a non-key
column. This would then appear to be a violation of 3NF. The data column
has a dependency on the type column, and "type" is not even part of the
primary key.
This makes it a lot more work to ensure the integrity of the data. What is
there to prevent you from changing the value of the type column in a way
that makes the data column meaningless?
Your description of the columns makes it clear that the scheme proposed by
Anonymous Monk in Re: OT: benefits of database normalization as "design 3" more accurately captures the
data model. In that model, the data type on the column will (in most DMBSs)
suffice to keep the data mostly right, with simple referential integrity
constraints covering the rest of the bases.
| [reply] |
|
|
the meaning of the "data" column depends on a non-key column. This would then appear to be a violation of 3NF
Ah, thank you for spelling that out. This is something I was still curious about. I have read the formal normalization rules many times, but actually seeing which applies where is still difficult for me. Seeing a problem in action, then knowing what it violates, is very helpful.
"design 3" more accurately captures the data model
Indeed, I like Design 3 more than my own Design 2. I feel kind of silly not thinking of it right away, but to my own defense, Design 2 was an evolution of what I'll now call Design 0. In Design 0, there were no aliases, only facts. When the idea of aliases came up, I proposed adding another table, and someone else proposed adding another column to the existing table. Thus, this node was born. :-)
| [reply] |
|
|
Howdy!
Indeed, I like Design 3 more than my own Design 2. I feel kind of silly not thinking of it right away, but to my own defense, Design 2 was an evolution of what I'll now call Design 0. In Design 0, there were no aliases, only facts. When the idea of aliases came up, I proposed adding another table, and someone else proposed adding another column to the existing table. Thus, this node was born. :-)
This makes Design 3 even more clearly the Right Way To Go.
Adding the elided step, having more than one name for the same fact could be
done by simply adding multiple rows with the same fact, but a different name.
IIRC, that could be a violation of 2NF, so you extract the "fact" column to
a separate table and replace the "complicated" value with a simple key and
rely on foreign key constraints to keep things honest. If the "fact" changes,
you change one datum in one place to accomplish that. If you want to delete a
"fact", either you cascade the delete to kill all the names that refer to it,
or you require that the names be nuked first. Adding/changing/deleting names
for a given fact is trivial.
Once the relationship between "name" and "fact" goes beyond one-to-one, you
have to factor out the "many" side into its own table that refers to the
"one" side.
...now if the rules change so that a name could refer to more than one fact,
you have to make a third table that relates names to facts...but that is
probably absurd for the situation at hand.
| [reply] |
Re: OT: benefits of database normalization
by TedPride (Priest) on Oct 03, 2004 at 21:48 UTC
|
id = int
type = int
name = char (20?)
desc = char (255?)
ID will be primary key and auto increment, so you don't have to worry about duplicate IDs. Type will be int, since mySQL uses int space for bools anyway, and you might want other types in the table (flagged for checking, flagged for delete, etc). Name will be whatever the max size of your words is. Desc will be the max size of your descriptions. You can use different field names if name and type give errors, and aliases will have ID-delimiter-name in the desc field.
Yes, this does make for potential problems if you remove a word and forget to also remove all aliases pointing to it, but a simple DELETE FROM table WHERE desc = 'ID-delimiter-name' will fix that. Just do it before you delete the word, NOT after. The only downside is that your aliases take up extra space (description length minus alias info length), but then again, you only have to do one query per word access, not two (if you have aliases or descriptions in a separate table), and you don't have fields that aren't being used (if you keep alias info separate from description field in the same table). Alias deletion itself does not have to be efficient, since you will hardly ever delete words. | [reply] |
|
|
Maybe I'm misunderstanding something, but how is this different from Design 1? It looks exactly the same, just with a bit more detail on the data types, and different column names.
| [reply] |
Re: OT: benefits of database normalization
by tilly (Archbishop) on Oct 09, 2004 at 19:08 UTC
|
I'm entering this discussion very, very late to say that I'm surprised by the poor quality of this discussion. While several people did come up with the obvious schema of moving definitions off into their own table, nobody gave you a simple argument why that particular design is obviously right and the others are obviously flawed.
This is supposed to be a dictionary. The most common query that you'll have is to give me the definition of a name. Try to write that query with each schema. Keep in mind that over time if you allow A to be an alias to B, then some day someone is going to make B be an alias to C.
Any schema where you cannot write that query easily, and have it work after obvious data manipulations potentially mess you up, should be rejected in favour of a schema where common stuff is simple. And that holds for other design problems. When you're handed a design, say to yourself, "What do I think that I'll need to do, and how will I do it with this design?" If common operations turn into messes, the design should be viewed with suspicion. | [reply] |
|
|
I'm surprised by the poor quality of this discussion
Some of it may be my fault. My initial question didn't describe the problem at all, only contained the sample data. I thought it would be obvious what I was getting at, but apparently I was wrong.
Try to write that query with each schema
Jenda did indeed illustrate the differences between the designs with sample queries for each one. This is a good general technique, though, and probably deserves to be emphasized. I'll keep it in mind in the future.
| [reply] |
|
|
Oops. I (obviously) missed that post.
| [reply] |
Re: OT: benefits of database normalization
by johnnywang (Priest) on Oct 03, 2004 at 22:05 UTC
|
This is probably design 4, I would do the following as my first try, just one table
| fact_id | name | parent | fact |
| 1 | foo | NULL | foo is fooity foo |
| 2 | bar | NULL | bar is barrish bar |
| 3 | foog | 1 | NULL |
Basically the "parent" column indicates where it is pointed to. | [reply] |
|
|
I have to say, I don't like this one much at all. It's marginally better than Design 1, because there isn't a column that changes meaning depending on another column's value. But the fact that either the parent or the fact are NULL in every record just makes for wasted space.
Also, in your design, just like Design 1, there is some data consistency logic that must be present in the application layer. I agree with others that this type of logic should almost always be encapsulated in the database engine. For instance, what happens if a fact has both a parent and a fact? You say your application won't let that happen? Well, maybe it was inserted by someone else's buggy code. No matter how it happened, there is now an inconsistency in the database. With Design 3, that cannot happen, because the database engine won't let it (assuming a sufficiently advanced database engine, but that's not really the issue here).
| [reply] [d/l] [select] |