in reply to They do respect set theory!
in thread (OT) Why SQL Sucks (with a little Perl to fix it)

It might have the same values for all attributes as some other row of the table, but it's still unique. Another way to think about it is that there is a hidden "unique ID" column in every table (you need only add such a column if the table has no primary key).

At first blush this seems like a compelling argument and I've struggled to find just the perfect example to explain the problem. I can't, so I'll go with an imperfect example. First, I want to paraphrase what you said so that if I have misunderstood you, you can correct me.

Consider the "supplier_parts" table:

+------------------+ | Supplier_Product | +------------------+ | SNO | PNO | +---------+--------+ | S1 | P1 | | S1 | P1 | | S1 | P2 | +------------------+

Your argument is that we can either infer the existence of a hidden "unique ID" or put an explicity primary key on the table to make the rows a set. You are right that this would turn the row into a set. However, it also violates 3NF. By substituting a clearer example, I can explain this better:

+-----------------+ | City | Country | +-------+---------+ | Paris | France | | Paris | France | | Paris | USA | +-----------------+

What you're suggesting is that I could do this:

+----------------------+ | ID | City | Country | +----+-------+---------+ | 01 | Paris | France | | 02 | Paris | France | | 03 | Paris | USA | +----------------------+

The bag of rows also constitute a set, but now, "Paris, France" is not dependent on the ID as required by 3NF. Assuming these truly represent the same city, are we referring to 01 or 02? Which one should other records point to? Merely adding a unique ID does appear to satisy the "set" constraint, but since it violates 3NF, we now have a denormalized database and cannot maintain data integrity.

Cheers on catching my bad SQL, though :)

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re^2: They do respect set theory! (and violate 3NF: yeah GIGO)
by demerphq (Chancellor) on Dec 12, 2005 at 08:20 UTC

    Ovid this reasoning is flawed. The table with an identity column violates normalization no more than the original table did. All it does is give you a basis to understand the previous results. The DB at some level maintains those values anyway, all that adding the column does is make them explicit and available for reference. Redo your original queries to show the ID's and youll see the dupes in your output are due to dupes in your input.

    Incidentally adding the id doesnt violate normalization. What violates normalization is the two dupe rows in the table (by definition, any table that contains duplicate data in two columns across multiple rows is not fully normalized). Adding the identity column doesnt change that fact: Your claim that this means you cannot maintain data integrity is rubbish. On the contrary identity columns make it easier to maintain data integrity. Instead of useing "City/Country" as the key to the table, you use the ID as the key, and then put a unique index on City/Country, now your city/country data is guaranteed to be a set and the id's allow non ambiguous reference to a particular location object in the table.

    Assuming these truly represent the same city, are we referring to 01 or 02?

    You have no way of differentiating the two entries without the identity column. You have a way of differentiating the two entries with one. With the identity column references into that table are unambiguous, without the identity column references into that table are ambiguous. Without a constraint on duplicate values in the key columns in the table and without an identity column, the table cannot be unambiguously addressed for all data sets it contains. Ideally you should have both, a identity and a column level constraint prohibiting dupe values.

    End result: you are still using Garbage input to prove that the RDBMS is doing something wrong. Which is just silly.

    ---
    $world=~s/war/peace/g

      Incidentally adding the id doesnt violate normalization.

      But I didn't make that claim. I did claim that the resulting table violated 3NF, but I didn't state that it was because of the addition of the ID. I wasn't sure why you thought I was making that claim, but when I reread, I saw that I wrote "[the table ] also violates 3NF." I think I should have written "[the table ] still violates 3NF". However, I've written a lot of responses in this thread and I guess this one could have been clearer. Sorry for the confusion.

      What violates normalization is the two dupe rows in the table (by definition, any table that contains duplicate data in two columns across multiple rows is not fully normalized).

      Agreed.

      Adding the identity column doesnt change that fact: Your claim that this means you cannot maintain data integrity is rubbish.

      "Rubbish?" I view that as a pretty strong word in this context. If I've offended you or made you mad, I'm not sure how but this certainly wasn't my intent.

      As for data integrity: I stated that we couldn't maintain data integrity in with a denormalized database, not with any database.

      Cheers,
      Ovid

      New address of my CGI Course.

        Hi Ovid, I'm sorry maybe I did get a bit irked here. It seemed to me that you were saying that by adding the identity column you violate normalization and because of that the point that calenda made was wrong. Which is something that I disagree with pretty strongly as I think calenda had made a crucial point. I had read some of your other replies in the thread and you seemed to use this same argument to dismiss several points made elsewhere and altogether I guess it probably rubbed me the wrong way. I apologise for being so crusty, i should know not to reply to nodes before i finish my morning coffee. (Yes its an old excuse, but no less valid because of its age :-)

        But regarding data integrity, actually I do think the argument you put forth is unsound. Data integrity is maintained when you have can unambiguously reference a piece of information. When you put an identity column on a table that makes each row in the table unique, thus ensuring entity integrity, which in turn alows you to enforce referential integrity. Just because a table is denormalized doesnt mean its integrity is violated. Denormalization implies that maintaining integrity will be harder work, but it doesnt mean its impossible.

        BTW, i can imagine valid scenarios where a table could contain two field duplicated like your Paris,France example where data integrity wouldn't be an issue at all. For instance a table holding localized strings might inadvertantly contain two duplicate strings for two or more languages. For instance German/Austrian/Swiss German and even Dutch, Yiddish Afrikaans, English and several others could share common words for various things such as placenames but still have enough different that you wouldn't bother seperating out the dupes. Likewise you might see semantic duplicates like Paris and PARIS in the same table for similar reasons. None of this stuff violates integrity.

        I found a good link on data integrity, unfortunately the formatting on the page is pretty crap, but imo it is a good comprehensive list of the attributes of data integrity. The engine its talking about is Sql Server 7, but it shouldnt matter much for this discussion. Note the key points about entitiy, referential and domain integrity: Entity integrity ensures that each row in the table is uniquely identified, Referential integrity is concerned with keeping the relationships between tables synchronized, and Domain integrity requires that a set of data values fall within a specific range (domain) in order to be valid, with user defined integrity being the last. Only user integrity can be violated by two unambiguous rows having attributes in common, and that type of integrity is outside the range of relational logic.

        ---
        $world=~s/war/peace/g