The easy bit first: you and I appear to be in complete agreement about NULLs.
Now for the hard bit:
Your queries were improperly designed. You're conflating two questions into one query, so you have a hacked-up query (on top of improperly-designed tables).
There are two issues with that statement. First, the tables were not improperly-designed. They had bad data in them. This is a very important distinction which goes to the heart of the matter. More on this in a moment.
As for the "improperly designed" queries, I'll stipulate that they are, but only because SQL is not relational. In reading the queries, someone new to SQL could easily conclude (we see this often enough) that those queries are logically equivalent. It's only SQL's poor support for the relational model which makes those query results different. In fact, Date goes on to show 12 different ways of writing that query (including yours) which return 9 different results but which would all return the same, correct result if operations on relations (tables, if you will) returned sets. In fact, one of the fundamental points of the relational model is that a relation (what we normally think of as a table) is an unordered set (in the mathematical sense) of tuples. Any operation on one or more relations must return another relation -- in other words, it must return a set. This includes selecting, inserting or updating information. If operations violate this, then the system implementing this operation is by definition not relational.
This gets back to the first point about the bad data in the tables. First, a set is a collection of unique items. There is no such thing as the set of integers (3,3,2). That should be the set of integers (3,2). Further, the union of (1,2,3) and (2,3,4) yields (1,2,3,4), not (1,2,2,3,3,4). SQL allows us to add duplicate tuples in tables. That's one of the reasons they're called "tables" and not "relations". Sure, I can put a unique constraint across all of the columns, but why should I? They should be unique in the first place. There's no reason those duplicates should exist and having SQL graciously allow us a constraint which should be there in the first place doesn't fix the problem (and I tell ya, it's real fun finding out that some maintenance programmer put a few thousand duplicate rows in a lookup table with no unique key to let me delete the extras).
So right off the bat, SQL allows us to violate the relational model when inserting records. Of course, one could argue that "that's just bad database design". Fine, let's stipulate that you won't enter duplicate rows.
Now, if you were to provide something in at least 3NF that demonstrates your point about the lack of relational theory, then I'll be glad to learn from you. But, don't start out with something that's not even 1NF, conclude that relational theory is completely ignored by RDBMS vendors, and expect to not be taken to task about it.
Again, the mere fact that SQL allows duplicate rows in tables means that SQL ignores the relational model and DBMS vendors are largely responsible for not providing us with tools which respect the model. However, if that's not enough to convince you ...
So a salesperson walks up to an accountant and says "Bob, for our customers with excellent credit ratings, what cities do they live in? I need to plan flights through those cities."
"What do you mean by 'excellent credit rating', Alice?"
"For the sake of argument, let's include everyone with a credit rating greater than 700."
"OK, let's see ... there's London. Oh, and there's London. And Paris. And London. London again, Athens, London, Paris, Moscow, London and Paris."
"That's a hell of a flight."
The answer Bob should have given Alice is "London, Paris, Moscow, and Athens". Repeating those city names over and over without reason doesn't make his assertion any more true. In fact, it almost makes the response non-sensical and if your accountant always answered questions that way, you'd think he's mental. But that's the sort of answer this query gives:
SELECT city
FROM cities, customers
WHERE customers.city_id = cities.city_id
AND customers.credit_rating > 700
That could easily be on a database in 3NF. There's nothing about the structure of the query which implies that the database is poorly designed.
Alice shouldn't have to tell Bob to give her "DISTINCT" cities because repeating the city names not only makes no sense, it helps to confuse the issue. SQL does not return sets by default. If it returns duplicates, it's not a set and thus the system isn't relational. Sets do not have duplicates. Now in the example above, if Alice is really stupid, she's going to have one hell of a long flight. Fortunately, Alice is not really stupid. Unfortunately, software is. If, instead of Alice, it's some software receiving the results of that query, things could go disastrously wrong.
If SQL followed relational theory, simple queries like this would return correct results. Instead, the SQL programmer has to remember that SQL will happily churn out garbage. The programmer is then put in the awkward position of having to either put "SELECT DISTINCT" everywhere or remember that "SELECT DISTINCT" can have horrible impacts on performance and thus should be used very selectively. If SQL was relational, the programmer would never have to worry about this.
What I'd like to hear, and I haven't, is a justification for the "not returning sets" behavior. If there any realistic justification one can give for queries to return duplicate information, I'd love to hear them. (The "but it's too slow otherwise" response is faulty because it confuses implementation with behavior.)
| [reply] [d/l] |
(Below, when I speak about an RDBMS, I am speaking about Oracle9i. Almost every statement will be applicable to the majors, but they are correct for Oracle.)
First, the tables were not improperly-designed. They had bad data in them. This is a very important distinction which goes to the heart of the matter.
A table with bad data in it is an improperly-designed table. A properly designed table cannot, by virtue of it being properly designed, have bad data. It might have incorrect data, but the data cannot be bad. I will address your further points in the order you presented them, but this is a critical understanding of RDBMSes that you don't seem to have. (It's rare to see it in someone who hasn't been a DBA, so don't feel bad.)
Another critical understanding that you're missing - the "relational" part in an RDBMS isn't for the table - it's for both the row and how rows in different tables relate. All the columns in the row are related to the primary key column(s) and columns in different tables relate rows together. The "set" part comes in that the keys are unique, not that the results of a query are unique. A table is a set of primary keys, along with the values that they index into. The Perl analogy would be a hash. The keys of a hash are a set, but the values are not. This is a good thing.
No, SQL is not relational. If it was, it would be called RQL for "Relational Query Language," not "Structured Query Language." SQL was designed for Business Analysts, not programmers. It's a 4GL - one of those "natural language programming languages" that were touted in the 80's. That it's a complete failure both as a programming language and as a tool to work with databases is only to have been expected.
... a relation (what we normally think of as a table) is an unordered set (in the mathematical sense) of tuples. ... SQL allows us to add duplicate tuples in tables. That's one of the reasons they're called "tables" and not "relations". Sure, I can put a unique constraint across all of the columns, but why should I?
You're conflating so many different concepts here that I almost don't know where to start.
- Most RDBMSes make a distinction between DML and DDL - the language one uses to query the tables (Data Manipulation Language), such as SQL, and the language one uses to define the tables (Data Definition Language). Neither of these are "relational." In fact, it doesn't make sense to talk about them as "relational." They can be used to interact with things that, if designed and built properly, are relational.
- A table in a database doesn't make any sense unless it has a primary key. In fact, that is why the First Normal Form (or 1NF) is "Every row must have a primary key." Without that, you're absolutely correct in that the table is a bag and not a set. With that, the table is a set that defines a relation between the primary key column(s) and the dependent column(s). You don't need to have a unique constraint across all the columns, just across the one(s) that make up the primary key.
- While you can think of the row as a tuple of all its coluns, it's more accurate to think of it as a function that, given a tuple for the primary column(s), returns a tuple for the dependent column(s). After all, a relation is nothing more than a function that maps items from one set to items in another set.
- You are correct in that SQL will allow you to attempt to do anything you want to a given table. It's up to the table's designer(s) to have set the appropriate restrictions upon the table.
- Tables are called tables and not relations because they aren't relations. The relation is within the row, not the table. (See above.)
As for Bob and Alice, there's a very good reason why SQL not only gives back that answer, but SHOULD give back that answer, and it comes down to the Principle of Least Surprise.* Let's say you have the query SELECT foo FROM my_table; That returns back 10 rows. Under the current system, that means that there's 10 rows in my_table and here are the values in the foo column for each one. Under your idea, that means there are 10 distinct foos in my_table, regardless of how many rows.
Now, what hapens when I say SELECT foo, bar FROM my_table;? Under the current system, this still returns 10 rows. Under your system, it could return any number of rows, so long each row it returns is a unique (foo,bar) combination. For people that don't understand anything about set theory, that's somewhat counter-intuitive.
Believe me - I do understand your consternation. The idea that an operation upon a set returns a collection is counter-intuitive to those who understand any set theory at all. But, I point back to my earlier point - the set isn't the rows in the table taken as a whole - it's the range of the function between the primary key and the dependent columns. As long as your query returns the primary key, it is a set. If your query doesn't, then it doesn't returns a set.
To elucidate, let's look very quickly at what a SELECT does, under the hood. The basic operation, as you've noted, is to take a set (the PK tuple), apply a set of restrictions on it (either the PK tuple itself or the values tuple), then return the requested values from all rows that satisfy those restrictions. As long as your table(s) being queried was a set to begin with, the results will also be a set. However, your view of the results may or may not be a set, depending on what you've chosen to see.
Let's take your specific example:
SELECT cities.name
FROM cities, customers
WHERE customers.city_id = cities.city_id
AND customers.credit_rating > 700
I'm going to rewrite as:
SELECT (cities.name)
FROM (customers JOIN cities USING (city_id))
WHERE (customers.credit_rating > 700)
Assuming that customers and cities are both sets, then we have four sets involved in that query.
- cities
- customers
- the resultset
- the fromset
Your resultset is a set, believe it or not. It's the set of all rows from the intermediate table "customersXcities" that satisfies the criterion of customers.credit_rating > 700. This intermediate table has the same PK of your customers table. Think about that for a second - you're not querying the cities table directly. You're querying an intermediate table. Each row in your resultset is a row from that intermediate table. Then, the SELECT clause applies a set of vertical black strips to the resultset, each on top of a column. You turn off the strip by requesting that column's values be given to you. The SELECT clause isn't a set operation.
*: Don't give me that shocked look - you had to have known that the PoLS would make an appearance in this discussion. If only so that I could tweak you about how your surprise was anticipated before you were born. :-)
My criteria for good software:
- Does it work?
- Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
| [reply] [d/l] [select] |
You know, I think you and I are largely in agreement on what many of the core ideas are yet we differ about what we think valid behavior should be, though we'll quibble in the margins.
A table with bad data in it is an improperly-designed table. A properly designed table cannot, by virtue of it being properly designed, have bad data.
I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns. (Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF -- see my response to Celada for more discussion).
As for your comments about "relations", I can only refer you to Date's work. It would take me a loooong time to write out exactly what I am referring to when I say "relation". This term goes back to Codd's initial formulation of the relational model and is loosely analogous to what we refer to as a table.
For the moment, ignoring implementation in favor of behavior, if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?
(I've a sneaky feeling that neither of us is going to give much ground here :)
| [reply] |
Another critical understanding that you're missing - the "relational" part in an RDBMS isn't for the table - it's for both the row and how rows in different tables relate.
"Relational" comes from "A Relational Model of Data for Large Shared Data Banks" which says
The term relation is used here in its accepted mathematical sense. Given sets S1, S1, ···, Sn, (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S1, and so on. We shall refer to Sj as the jth domain of R. As defined above, R is said to have degree n. Relations of degree 1 are often called unary, degree 2 binary, degree 3 ternary, and degree n n-ary.
That is the origin of the term in connection with databases. A relation is not a row, it is a set of rows and although 2 relations may have columns that correspond as primary/foreign keys it is just a linguistic coincidence that you might call this corresponsdence a "relationship".
| [reply] |