OT: benefits of database normalization

revdiablo has asked for the wisdom of the Perl Monks concerning the following question:

I was having a discussion with someone the other day, and was having a hard time thinking of the practical benefits of one database design over another. I was advocating the more normalized version, whereas the other party was advocating the non-normalized one.

Consider these two designs:

Design 1

Facts

fact_id type name data

1 1 foo foo is a fooity foo

2 1 bar bar is a barrish baaar

3 2 foog 1

Design 2

Facts

fact_id name fact

1 foo foo is a fooity foo

2 bar bar is a barrish baaar

Aliases

alias_id fact_id name

1 1 foog

Update: these designs attempt to store a list of facts (think of dictionary definitions), and aliases. A fact is comprised of a word (what I called name), and a definition (what I called fact). There are also aliases, which have a word (again, name), but no definition. They point to another fact for the definition. This way, when the original fact is changed, the aliased one appears to be changed too.

To my eyes, Design 1 is not normalized to a satisfactory extent (i.e. none at all). The data column's meaning changes depending on the value of the type column. Even though I am not well-versed enough to cite what principle of normalization this violates, it feels wrong to me.

Despite this intuitive feeling that it's "just wrong," I had a hard time articulating what about it is wrong. I could not think of any practical reasons to use Design 2. I cited what I consider a powerful abstract reason, separation of cocerns. To me, Facts and Aliases are different types of values, and should not be shoehorned into the same table. The other party did not agree with this assessment, and continued to demand practical reasons.

So I come to you, my fellow Monks, in search of answers. Perhaps I am dead wrong, and the original design is perfectly acceptible? Maybe there is a more powerful abstract reason I can not think of? Or perhaps there are even some good practical reasons? Any responses will be greatly appreciated.

Comment on OT: benefits of database normalization Select or Download Code

Replies are listed 'Best First'.

Re: OT: benefits of database normalization
by dws (Chancellor) on Oct 03, 2004 at 01:37 UTC

The data column's meaning changes depending on the value of the type column.

One phrase that comes to mind is "job security". By constructing tricky schemas like the first one, someone makes themselves extra "useful" writing the overly complicated query code necessary to access data correctly, at least until they're taken out and shot by the people who have to maintain and extend the result.

The problem with columns that change their types or meanings depending on other columns (like the type column above) is that complexity gets pushed out in all directions. Anyone who queries the data needs to take on the complexity of deciphering the type field, plus whatever other contortions are necessary to process the data. Standard tools, and standard approaches stop working cleanly. And the more time passes, the more people forget why the poor design was chosen in the first place, which can make it harder to undo the damage.

When I see stuff like this, it's often because someone hasn't gotten over their fear of database JOINs, which can be scary until you've done them a few times. However, JOINs are an essential tool. Trying to avoid them is like being afraid to take off the training wheels on a bicycle.

Re: OT: benefits of database normalization
by Anonymous Monk on Oct 03, 2004 at 03:59 UTC

Personally, I would work Design 2 more along the lines of:

Design 3

fact_id fact
------- ----------------------
1       foo is a fooity foo
2       bar is a barrish baaar

name_id name fact_id
------- ---- -------
1       foo  1
2       bar  2
3       foog 1
[download]

Re^2: OT: benefits of database normalization

by Joost (Canon) on Oct 03, 2004 at 10:40 UTC

"What should it profit a man, if he should win a flame war, yet lose his cool?"

Re^3: OT: benefits of database normalization

by Jenda (Abbot) on Oct 03, 2004 at 16:40 UTC

Nothing prevents you from adding a "Type" column to the second table to distinguish the two, if you need to. That way you may make the distinction if you want or treat them the same if you don't. Let's see the simplest query "find the description for this name, I don't know whether its a base name or an alias".

First design:

SELECT fact
  FROM Facts
 WHERE name = ?
   and Type = 1
UNION
SELECT BaseFacts.fact
  FROM Facts as BaseFacts
  JOIN Facts as AliasFacts ON BaseFacts.fact_id = Convert(int,AliasFac
+ts.data)
 WHERE AliasFacts.name = ?
   and AliasFacts.Type = 2
   and BaseFacts.Type = 1
-- and this all works only if you do not ever have an alias to an alia
+s!
[download]

SELECT fact
  FROM Facts
 WHERE name = ?
UNION
SELECT fact
  FROM Facts
  JOIN Aliases ON Aliases.fact_id = Facts.fact_id
 WHERE Aliases.name = ?
[download]

SELECT fact
  FROM Facts
  JOIN Names ON Names.Fact_id = Facts.fact_id
 WHERE Names.Name = ?
[download]

Jenda
We'd like to help you learn to help yourself
Look around you, all you see are sympathetic eyes
Stroll around the grounds until you feel at home
-- P. Simon in Mrs. Robinson

[reply]
[d/l]
[select]

Re^3: OT: benefits of database normalization

by mhi (Friar) on Oct 04, 2004 at 09:21 UTC

Design 3a

fact_id fact                   orig_name_id
------- ---------------------- ------------
1       foo is a fooity foo    1
2       bar is a barrish baaar 2

name_id name fact_id
------- ---- -------
1       foo  1
2       bar  2
3       foog 1

If you then say orig_name_id must be unique and not null and must reference an existing name_id, the DBMS will automatically check data integrity for you instead of you having to code an extra function to make sure you don't have too few or too many name_ids that claim to be the original.

That is one of the main reasons you would want to normalize your data structure: To be able to utilize the DBMS's built-in functions to ensure data integrity.

Re^3: OT: benefits of database normalization

by Pragma (Scribe) on Oct 06, 2004 at 21:51 UTC

(fact_id, name_id)

fact_id fact
------- ----------------------
1       foo is a fooity foo
2       bar is a barrish baaar

fact_id name_id name
------- ------- -------
1       1       foo
1       2       foog
2       1       bar
[download]

WHERE fact_id = ? AND name_id = 1

[reply]
[d/l]
[select]

Re^2: OT: benefits of database normalization

by revdiablo (Prior) on Oct 03, 2004 at 20:05 UTC

Ah, I like this design even better. It works similarly to many filesystems. The fact is like an inode, and the name is like a link. Which is actually sort of interesting, because the person I was originally discussing with kept bringing up filesystem analogies. Neither he nor I ever took the analogy to its logical conclusion, though. Many thanks for the reply!

[reply]
[d/l]
[select]

Re: OT: benefits of database normalization
by Velaki (Chaplain) on Oct 03, 2004 at 02:57 UTC

A mantra heard frequently in the database world is:

The data depends on the key, the whole key, and nothing but the key.

In turn, this leads to frequently complex normalization; however, that can be a good thing in terms of isolating functionality, relationships, and even implementation.

How normal is normal? It depends. Many's the time I see data stored very normalized for complex OLAP manipulation; whereas, I see it flattened out for performance reasons. It's a trade-off. Only you can decide.

"Perl. There is no substitute."

Re^2: OT: benefits of database normalization

by pg (Canon) on Oct 03, 2004 at 03:11 UTC

My take is:

To fully normalize your opeartional database.
To denormalize your data warehouses, for mainly performance reason. (the chance that data go out of sync is much lower, as most likely you only add data to data warehouse, but don't update.)

I am everyday dealing with a not normalized operational database, the biggest problem we facing is "out of sync'd data". From time to time, when my users complaining some data is wrong, I go check database, some of the data is stored in 4 or 5 different tables, and they simply have different values, which one is right? which one is wrong? That's a living hell. You have to do database design right at the beginning, don't expect to fix anything later, that is just not possible.

Re^2: OT: benefits of database normalization

by revdiablo (Prior) on Oct 03, 2004 at 03:11 UTC

How normal is normal? It depends.

True, not everything needs to be normalized to 5NF, but there are benefits to each level of normalization. I was wondering if there were any convincing benefits of Design 2 over Design 1. I like Design 2 much better, but convincing someone else why proved difficult for me.

Basically, I'm curious about the justification for that particular type of normalization. It would probably help greatly if I knew what type of normalization I was talking about, but I don't. That's why I posted the examples. Based on some of the responses, though, they may not have been as helpful as I hoped they would.

Re: OT: benefits of database normalization
by Joost (Canon) on Oct 03, 2004 at 10:45 UTC

"What should it profit a man, if he should win a flame war, yet lose his cool?"

Re: OT: benefits of database normalization
by saberworks (Curate) on Oct 03, 2004 at 09:02 UTC

There are so many reasons to avoid using single columns to hold different types of data. For example, what if you wanted to pull all the data out sorted by "data?" Or what if you wanted to pull out all the unique bits of data? You end up doing crappy type conversions with built-in functions and basically abusing what a database is for.

Re: OT: benefits of database normalization
by jZed (Prior) on Oct 03, 2004 at 01:57 UTC

Also, btw, both "name" and "type" are SQL reserved words and should not be used as column names unless they are quoted.

Re^2: OT: benefits of database normalization

by astroboy (Chaplain) on Oct 03, 2004 at 08:53 UTC

Really? I looked at the ANSI SQL-92, SQL-99 and SQL-2003 reserved words lists, and "name" and "type" weren't listed in any of them

Re^3: OT: benefits of database normalization

by jZed (Prior) on Oct 03, 2004 at 15:25 UTC

A Guide to the SQL Standard, C.J. Date, p. 32.

So I apologize for over-stating that these are reserved, but stand by my advice to avoid them as identifiers unless delimited.

Re^2: OT: benefits of database normalization

by revdiablo (Prior) on Oct 03, 2004 at 02:19 UTC

I'm afraid I don't understand your schema . . . are there other types? If there are other types, how are they represented in design #2?

There are only two "types". There are facts, and aliases. I don't really think of them as types, though. I think of them as completely different things. This is the crux of the difference between the two designs.

2 meaning that it's an alias and that the data should be interpreted as a fact_id

Yes, this is how Design 1 was intended to work.

Also, btw, both "name" and "type" are SQL reserved words and should not be used as column names unless they are quoted.

True. They were really just for the sake of the example.

Re^2: OT: benefits of database normalization

by pg (Canon) on Oct 03, 2004 at 02:19 UTC

"Also, btw, both "name" and "type" are SQL reserved words and should not be used as column names unless they are quoted."

Not true, at least not true to all database implementation. I just tried this in Oracle, and it worked with no problem:

create table temp(type number(1), name number(1))
[download]

Re^3: OT: benefits of database normalization

by jZed (Prior) on Oct 03, 2004 at 05:10 UTC

The fact that Oracle lets you get away with using those doesn't mean that it is a good idea to use them anymore than the fact that MSIE lets you get away with sloppy HTML means that you should write sloppy HTML. If there's any chance that the schema will be ported to another RDBMS in the future (and the OP is asking questions about the long range), then it is better to avoid ISO/ANSI SQL reserved words. Another reason to avoid them is if the schema will be used with parsing, cataloging, or validating software which may or may not recognize them.

Re: OT: benefits of database normalization
by CountZero (Bishop) on Oct 03, 2004 at 11:57 UTC

NamesAndAliases

NAA_id Name Definition_id

1 Foo 1

2 Foo_bis 1

3 Bar 2

Definitions

Def_id Definition

1 A Foo-ish thing

2 A Bar-ish thing

All names and aliases are in one table, if you wish, you can add another column to distinguish between names and aliases, although I see no real necessity.

All definitions are in a second table, so updating this table automatically updates the definitions of all related names and aliases.

Update: I only now see that I said the same as the AM in Re: OT: benefits of database normalization

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: OT: benefits of database normalization
by aquarium (Curate) on Oct 03, 2004 at 12:38 UTC

Design 1 is NOT normalized, I agree. The biggest practical problem with non-normalized databases is that Add/Delete/Update become complicated. Here's an example of the problem with design 1: lots of facts and aliases already loaded. when you delete a fact, you have to scan the whole table to delete all matching aliases. same problem when you update the key column of a fact. Also, you can add a alias that doesn't point to a valid fact. So all this logic gets put into the code of the application, instead of having a normalized database that enforces these relationships. Most such happy coders get the code 99% right, and one day the %1 sneaks in, and someone has to clean up the mess and build it right (normalized) as it should have been. There are cases when de-normalization is used, in strict control, and only for specific reasons....Perhaps "they" should be giving you reasons WHY to use a de-normalized model; instead of you justifying normalization. BTW you can easily port a system between different languages and databases when the db is normalized...an application centric (non-normalized db) system is not easily ported.

Re^2: OT: benefits of database normalization

by keath (Initiate) on Oct 03, 2004 at 13:49 UTC

inherent

Re: OT: benefits of database normalization
by exussum0 (Vicar) on Oct 03, 2004 at 11:51 UTC

select
 *
from (
  select
    a.pk1
  from
     table a,b
  where 
     a.pk1 = b.pk1 (+)
) a_outter_joined,b
a_outter_joined.pk2 = b.pk2 (+) and
(a_outter_joined.pk1 = b.pk1 or a_outter_joined.pk1 = null )
[download]

select
   *
from
   a,b
where
   a.pk1 = b.pk1 (+) and
   a.pk2 = b.pk2 (+)
[download]

----
Then B.I. said, "Hov' remind yourself nobody built like you, you designed yourself"

[reply]
[d/l]
[select]

Re^2: OT: benefits of database normalization

by pg (Canon) on Oct 03, 2004 at 15:11 UTC

"If it lacks a single column primary key, some databases, like oracle, make it harder to do an outter join."

I am with dragonchild on this, and your assertion is not true. I deal with Oracle everyday, and never heard of this, as a matter of fact, most of our tables don't have single column primary key.

Go even deeper, personaly I am against the idea of using artificial numbers as keys. If the primary key takes 3 or 4 columns, just take. In this situation, some people will make up a sequence number as single column primary key, I don't.

Something a bit off topic, I hate using current time as part of the primary key. In this system I am supporting now, there are several tables are like this, and every time when two users try to insert to the same table at same time, one of them get kicked because of unique constraint. That's ugly, and is a design defect.

Re^2: OT: benefits of database normalization

by dragonchild (Archbishop) on Oct 03, 2004 at 14:54 UTC

If it lacks a single column primary key, some databases, like oracle, make it harder to do an outter join.

Could you back up this assertion? I've never heard this and I work with an Oracle9i database that has no primary keys whatsoever. (Long story ... I have no control over it.) If this were true, I and the DBA could take it to our VP as reason #2348769 as to why the application developers shouldn't have control over the schema.

Being right, does not endow the right to be rude; politeness costs nothing.
Being unknowing, is not the same as being stupid.
Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re^3: OT: benefits of database normalization

by exussum0 (Vicar) on Oct 03, 2004 at 15:14 UTC

Someone else has the problem here. We use 9i at work as well, and I'll double check version numbers to get you something more specific in terms of version numbers in 24 hours. But last I checked, when I tried it, it barfed that error at me.

----
Then B.I. said, "Hov' remind yourself nobody built like you, you designed yourself"

Re^4: OT: benefits of database normalization

by pg (Canon) on Oct 03, 2004 at 15:24 UTC

Re^4: OT: benefits of database normalization

by dragonchild (Archbishop) on Oct 04, 2004 at 00:45 UTC

Re: OT: benefits of database normalization
by pg (Canon) on Oct 03, 2004 at 01:38 UTC

Cannot see any problem with design 1. Just base on what you gave, I have to say that it is normailized.

It would become not normailized, for example, if you also have type description in this same table.

Ideally, and I guess design 1 does, you shall have another table called type, which stores things like type id, type description etc. And the type column of the fact table shall be a foreign key points to the type number column of type table.

I don't like design 2.

Re^2: OT: benefits of database normalization

by revdiablo (Prior) on Oct 03, 2004 at 02:23 UTC

Cannot see any problem with design 1.

You don't think it's wrong to mix two different things into one table? Or are you of the opinion that Aliases and Facts are two "types" of the same thing?

I have to say that it is normailized.

Surely one column's meaning changing depending on the value of another is a violation of some rule of normalization? Like I said, I don't know the names of all the rules, and what they mean exactly, but this seems wrong.

I don't like design 2.

Any particular reasons?

Re^3: OT: benefits of database normalization

by pg (Canon) on Oct 03, 2004 at 02:36 UTC

My understanding was based on the name of the columns: "type". I don't deny that there is a chance of misunderstanding. Column type might mean something that really not sort of "type". To really appreciate whether design 1 is actually normalized, I have to judge base on the meaning of the true data, which you didn't really specify in your original post. (It would be much easy if we are in front of a whiteboard now ;-).

To clarify, what I meant was: (I have to use something that is more meaningful to everyone, something that is part of everyone's daily life, and yet describes my understanding of your design 1. ProductOf is my understanding of your "type".)

Fruit	ProductOf	Taste
Apple1	1	Some Taste
Apple2	2	Some Taste
Orange1	1	Some Taste

country code	country name
1	US
2	Canada

This is normalized, but will be not normalized, if:

Fruit	ProductOf	country name	Taste
Apple1	1	US	Some Taste
Apple2	2	Canada	Some Taste
Orange1	1	US	Some Taste

Re^4: OT: benefits of database normalization

by revdiablo (Prior) on Oct 03, 2004 at 02:59 UTC

Re: OT: benefits of database normalization
by herveus (Prior) on Oct 04, 2004 at 12:02 UTC

In the first design, the meaning of the "data" column depends on a non-key column. This would then appear to be a violation of 3NF. The data column has a dependency on the type column, and "type" is not even part of the primary key.

This makes it a lot more work to ensure the integrity of the data. What is there to prevent you from changing the value of the type column in a way that makes the data column meaningless?

Your description of the columns makes it clear that the scheme proposed by Anonymous Monk in Re: OT: benefits of database normalization as "design 3" more accurately captures the data model. In that model, the data type on the column will (in most DMBSs) suffice to keep the data mostly right, with simple referential integrity constraints covering the rest of the bases.

yours,
Michael

Re^2: OT: benefits of database normalization

by revdiablo (Prior) on Oct 04, 2004 at 16:41 UTC

the meaning of the "data" column depends on a non-key column. This would then appear to be a violation of 3NF

Ah, thank you for spelling that out. This is something I was still curious about. I have read the formal normalization rules many times, but actually seeing which applies where is still difficult for me. Seeing a problem in action, then knowing what it violates, is very helpful.

"design 3" more accurately captures the data model

Indeed, I like Design 3 more than my own Design 2. I feel kind of silly not thinking of it right away, but to my own defense, Design 2 was an evolution of what I'll now call Design 0. In Design 0, there were no aliases, only facts. When the idea of aliases came up, I proposed adding another table, and someone else proposed adding another column to the existing table. Thus, this node was born. :-)

Re^3: OT: benefits of database normalization

by herveus (Prior) on Oct 04, 2004 at 16:52 UTC

Indeed, I like Design 3 more than my own Design 2. I feel kind of silly not thinking of it right away, but to my own defense, Design 2 was an evolution of what I'll now call Design 0. In Design 0, there were no aliases, only facts. When the idea of aliases came up, I proposed adding another table, and someone else proposed adding another column to the existing table. Thus, this node was born. :-)

This makes Design 3 even more clearly the Right Way To Go.

Adding the elided step, having more than one name for the same fact could be done by simply adding multiple rows with the same fact, but a different name. IIRC, that could be a violation of 2NF, so you extract the "fact" column to a separate table and replace the "complicated" value with a simple key and rely on foreign key constraints to keep things honest. If the "fact" changes, you change one datum in one place to accomplish that. If you want to delete a "fact", either you cascade the delete to kill all the names that refer to it, or you require that the names be nuked first. Adding/changing/deleting names for a given fact is trivial.

Once the relationship between "name" and "fact" goes beyond one-to-one, you have to factor out the "many" side into its own table that refers to the "one" side.

...now if the rules change so that a name could refer to more than one fact, you have to make a third table that relates names to facts...but that is probably absurd for the situation at hand.

yours,
Michael

Re: OT: benefits of database normalization
by TedPride (Priest) on Oct 03, 2004 at 21:48 UTC

ID will be primary key and auto increment, so you don't have to worry about duplicate IDs. Type will be int, since mySQL uses int space for bools anyway, and you might want other types in the table (flagged for checking, flagged for delete, etc). Name will be whatever the max size of your words is. Desc will be the max size of your descriptions. You can use different field names if name and type give errors, and aliases will have ID-delimiter-name in the desc field.

Yes, this does make for potential problems if you remove a word and forget to also remove all aliases pointing to it, but a simple DELETE FROM table WHERE desc = 'ID-delimiter-name' will fix that. Just do it before you delete the word, NOT after. The only downside is that your aliases take up extra space (description length minus alias info length), but then again, you only have to do one query per word access, not two (if you have aliases or descriptions in a separate table), and you don't have fields that aren't being used (if you keep alias info separate from description field in the same table). Alias deletion itself does not have to be efficient, since you will hardly ever delete words.

Re^2: OT: benefits of database normalization

by revdiablo (Prior) on Oct 04, 2004 at 05:06 UTC

Maybe I'm misunderstanding something, but how is this different from Design 1? It looks exactly the same, just with a bit more detail on the data types, and different column names.

Re: OT: benefits of database normalization
by tilly (Archbishop) on Oct 09, 2004 at 19:08 UTC

This is supposed to be a dictionary. The most common query that you'll have is to give me the definition of a name. Try to write that query with each schema. Keep in mind that over time if you allow A to be an alias to B, then some day someone is going to make B be an alias to C.

Any schema where you cannot write that query easily, and have it work after obvious data manipulations potentially mess you up, should be rejected in favour of a schema where common stuff is simple. And that holds for other design problems. When you're handed a design, say to yourself, "What do I think that I'll need to do, and how will I do it with this design?" If common operations turn into messes, the design should be viewed with suspicion.

Re^2: OT: benefits of database normalization

by revdiablo (Prior) on Oct 09, 2004 at 20:08 UTC

I'm surprised by the poor quality of this discussion

Some of it may be my fault. My initial question didn't describe the problem at all, only contained the sample data. I thought it would be obvious what I was getting at, but apparently I was wrong.

Try to write that query with each schema

Jenda did indeed illustrate the differences between the designs with sample queries for each one. This is a good general technique, though, and probably deserves to be emphasized. I'll keep it in mind in the future.

Re^3: OT: benefits of database normalization

by tilly (Archbishop) on Oct 10, 2004 at 01:56 UTC

Oops. I (obviously) missed that post.

Re: OT: benefits of database normalization
by johnnywang (Priest) on Oct 03, 2004 at 22:05 UTC

fact_id name parent fact

1 foo NULL foo is fooity foo

2 bar NULL bar is barrish bar

3 foog 1 NULL

Basically the "parent" column indicates where it is pointed to.

Re^2: OT: benefits of database normalization

by revdiablo (Prior) on Oct 04, 2004 at 05:02 UTC

I have to say, I don't like this one much at all. It's marginally better than Design 1, because there isn't a column that changes meaning depending on another column's value. But the fact that either the parent or the fact are NULL in every record just makes for wasted space.

Also, in your design, just like Design 1, there is some data consistency logic that must be present in the application layer. I agree with others that this type of logic should almost always be encapsulated in the database engine. For instance, what happens if a fact has both a parent and a fact? You say your application won't let that happen? Well, maybe it was inserted by someone else's buggy code. No matter how it happened, there is now an inconsistency in the database. With Design 3, that cannot happen, because the database engine won't let it (assuming a sufficiently advanced database engine, but that's not really the issue here).

[reply]
[d/l]
[select]

Back to Seekers of Perl Wisdom