They do respect set theory!

If you think about it just a little differently, they do respect set theory. Consider this.

Tables and views are sets. The rows of the table are isomorphic to the elements of the set. A SELECT query is nothing more than a operation that takes the cross product of zero or more sets (FROM) and then filters this big set (WHERE). But what are the elements of these sets?

The first thought is to consider that the elements are m dimensional vectors, where m is the number of columns in the table. Of course you can't have twice the same vector in a set (= you can't have two identical rows in a result) because that goes against what a set is! But if we use an obvious definition for vector equality, which is the vectors have the same dimension and the components are each equal to each other, we run into trouble from the start: the example tables you give cannot even exist in the first place! (since tables and views, not just results, are supposed to be sets)

So how do we solve this? You have to think of each row in each table as unique among all rows of the same table. Pretend that each row (element of the set) is a unique objec with m attributes. It might have the same values for all attributes as some other row of the table, but it's still unique. Another way to think about it is that there is a hidden "unique ID" column in every table (you need only add such a column if the table has no primary key). Now everything works. The row that is the cross of the first row from parts and the second row from supplier_parts is not the same as the row that is the cross of the second row from parts and the second row from supplier_parts even though they look the same because they have all the same values in the column. Well, for that matter the first and second rows of the parts table aren't the same row even though they share all the same values.

Suppose you reject the preceding paragraph as a terribly inelegant way to define uniqueness (and it is that, though it's the only way I know that works!). In doing so you restrict all your tables to contain only DISTINCT rows, or else they can not be interpreted as sets. You can force this by having a primary key in every table.

What happens? Now SELECT queries never produce duplicate results, just as set theory demands. Note that you have to SELECT * to see that this is true. Selecting a proper subset of the columns is only an output filter which can give the illusion that there are identical rows. It doesn't run splice (hey! I brought it back on topic by mentioning a Perl operator!) on the underlying vectors. But DISTINCT will help you filter those illusory duplicates from the output.

I'm not saying this is great, I'm just saying that's how you have to apply the set theory if you want it to work.

By the way, your last example, the one that results in the most outrageous number of rows, is wrong.

SELECT parts.pno
FROM   parts, supplier_parts
WHERE  ( supplier_parts.sno = 'S1' AND
         supplier_parts.pno = parts.pno )
   OR  parts.pname = 'Screw'
[download]

should be

SELECT parts.pno
FROM   parts, supplier_parts
WHERE  supplier_parts.pno = parts.pno AND (
         supplier_parts.sno = 'S1'  OR
         parts.pname = 'Screw'
       )
[download]

which gives a somewhat less outrageous 6 P1s and 1 P2.

Comment on They do respect set theory! Select or Download Code

Replies are listed 'Best First'.
Re: They do respect set theory! by demerphq (Chancellor) on Dec 11, 2005 at 09:23 UTC
Thank you. This is exactly the way I see it as well. I don't see this behaviour as a negative, in fact i see it as a positive. "Optimizing 'distinct' away" as ovid puts it is not done well because its not easy to do. And not making queries by default dupecheck means a temporary sort table or hash table doesn't have to be constructed, maintained and then thrown away for every query. The result set can be just piped back. For most well designed properly constrained databases there is no need for dupechecking, and when there is the need 'distinct' or 'group by' are waiting to handle it. BTW, IMO there are lots of reasons to bitch about SQL, but for me they are mostly syntax related. SQL was IMO not designed by a capable language designer, its syntax is horrible. (I'm not saying I could do better, but Im pretty sure there are folks out there that could.) --- $world=~s/war/peace/g	[reply]
Re: They do respect set theory! (and violate 3NF) by Ovid (Cardinal) on Dec 11, 2005 at 19:36 UTC
It might have the same values for all attributes as some other row of the table, but it's still unique. Another way to think about it is that there is a hidden "unique ID" column in every table (you need only add such a column if the table has no primary key). At first blush this seems like a compelling argument and I've struggled to find just the perfect example to explain the problem. I can't, so I'll go with an imperfect example. First, I want to paraphrase what you said so that if I have misunderstood you, you can correct me. Consider the "supplier_parts" table: `+------------------+ \| Supplier_Product \| +------------------+ \| SNO \| PNO \| +---------+--------+ \| S1 \| P1 \| \| S1 \| P1 \| \| S1 \| P2 \| +------------------+` [download] Your argument is that we can either infer the existence of a hidden "unique ID" or put an explicity primary key on the table to make the rows a set. You are right that this would turn the row into a set. However, it also violates 3NF. By substituting a clearer example, I can explain this better: `+-----------------+ \| City \| Country \| +-------+---------+ \| Paris \| France \| \| Paris \| France \| \| Paris \| USA \| +-----------------+` [download] What you're suggesting is that I could do this: `+----------------------+ \| ID \| City \| Country \| +----+-------+---------+ \| 01 \| Paris \| France \| \| 02 \| Paris \| France \| \| 03 \| Paris \| USA \| +----------------------+` [download] The bag of rows also constitute a set, but now, "Paris, France" is not dependent on the ID as required by 3NF. Assuming these truly represent the same city, are we referring to 01 or 02? Which one should other records point to? Merely adding a unique ID does appear to satisy the "set" constraint, but since it violates 3NF, we now have a denormalized database and cannot maintain data integrity. Cheers on catching my bad SQL, though :) Cheers, Ovid New address of my CGI Course.	[reply] [d/l] [select]
Re^2: They do respect set theory! (and violate 3NF: yeah GIGO) by demerphq (Chancellor) on Dec 12, 2005 at 08:20 UTC
Ovid this reasoning is flawed. The table with an identity column violates normalization no more than the original table did. All it does is give you a basis to understand the previous results. The DB at some level maintains those values anyway, all that adding the column does is make them explicit and available for reference. Redo your original queries to show the ID's and youll see the dupes in your output are due to dupes in your input. Incidentally adding the id doesnt violate normalization. What violates normalization is the two dupe rows in the table (by definition, any table that contains duplicate data in two columns across multiple rows is not fully normalized). Adding the identity column doesnt change that fact: Your claim that this means you cannot maintain data integrity is rubbish. On the contrary identity columns make it easier to maintain data integrity. Instead of useing "City/Country" as the key to the table, you use the ID as the key, and then put a unique index on City/Country, now your city/country data is guaranteed to be a set and the id's allow non ambiguous reference to a particular location object in the table. Assuming these truly represent the same city, are we referring to 01 or 02? You have no way of differentiating the two entries without the identity column. You have a way of differentiating the two entries with one. With the identity column references into that table are unambiguous, without the identity column references into that table are ambiguous. Without a constraint on duplicate values in the key columns in the table and without an identity column, the table cannot be unambiguously addressed for all data sets it contains. Ideally you should have both, a identity and a column level constraint prohibiting dupe values. End result: you are still using Garbage input to prove that the RDBMS is doing something wrong. Which is just silly. --- $world=~s/war/peace/g	[reply]
Re^3: They do respect set theory! (and violate 3NF: yeah GIGO) by Ovid (Cardinal) on Dec 12, 2005 at 19:07 UTC
Incidentally adding the id doesnt violate normalization. But I didn't make that claim. I did claim that the resulting table violated 3NF, but I didn't state that it was because of the addition of the ID. I wasn't sure why you thought I was making that claim, but when I reread, I saw that I wrote "[the table ] also violates 3NF." I think I should have written "[the table ] still violates 3NF". However, I've written a lot of responses in this thread and I guess this one could have been clearer. Sorry for the confusion. What violates normalization is the two dupe rows in the table (by definition, any table that contains duplicate data in two columns across multiple rows is not fully normalized). Agreed. Adding the identity column doesnt change that fact: Your claim that this means you cannot maintain data integrity is rubbish. "Rubbish?" I view that as a pretty strong word in this context. If I've offended you or made you mad, I'm not sure how but this certainly wasn't my intent. As for data integrity: I stated that we couldn't maintain data integrity in with a denormalized database, not with any database. Cheers, Ovid New address of my CGI Course.	[reply]
Re^4: They do respect set theory! (and violate 3NF: yeah GIGO) by demerphq (Chancellor) on Dec 12, 2005 at 21:32 UTC
Re: They do respect set theory! by fergal (Chaplain) on Dec 11, 2005 at 20:35 UTC
If you want to view it as a set of objects where each object has a life of its own and has slots that can be updated with new values etc, that's fine and that's very close to the implementation of modern RDBMSs but that is absolutely not the relational theory and that is Date's point. We are stuck with an interface that is derived from the implementation - which can be ugly and sometimes throws up some nasty surprises. We should have an interface that comes from the theory and let the DBMS figure out how to efficiently do the work. This is of course a difficult problem but so is writing a compiler for a high-level language. Current DBMSs are forcing us to program in "assembly language". Maybe a better analogy is that they make us use lots of GOTOs when they should provide us with WHILEs and FORs. A good example is the necessity of have ID columns all over your database. In Date's world, these would not exist - more precisely these would not be part of the interface, although they would almost certainly be part of any efficient implementation, you just wouldn't see them or ever have to even think about them. The relational theory is about the only theory of stored data that actually has a full mathematical theory backing it up. None of the "object relational" systems have a proper formal basis for example. As such it's the only one where we are guaranteed consistency and where we can look at 2 queries that looks like they're selecting the same things and say "yes they will both produce the same results no matter what data is in my database".	[reply]
Re: They do respect set theory! by Anonymous Monk on Feb 27, 2006 at 19:57 UTC
Another way to think about it is that there is a hidden "unique ID" column in every table (you need only add such a column if the table has no primary key). That directly violates the Information Principle...	[reply]