in reply to Re^2: (OT) Why SQL Sucks (with a little Perl to fix it)
in thread (OT) Why SQL Sucks (with a little Perl to fix it)

(Below, when I speak about an RDBMS, I am speaking about Oracle9i. Almost every statement will be applicable to the majors, but they are correct for Oracle.)

First, the tables were not improperly-designed. They had bad data in them. This is a very important distinction which goes to the heart of the matter.

A table with bad data in it is an improperly-designed table. A properly designed table cannot, by virtue of it being properly designed, have bad data. It might have incorrect data, but the data cannot be bad. I will address your further points in the order you presented them, but this is a critical understanding of RDBMSes that you don't seem to have. (It's rare to see it in someone who hasn't been a DBA, so don't feel bad.)

Another critical understanding that you're missing - the "relational" part in an RDBMS isn't for the table - it's for both the row and how rows in different tables relate. All the columns in the row are related to the primary key column(s) and columns in different tables relate rows together. The "set" part comes in that the keys are unique, not that the results of a query are unique. A table is a set of primary keys, along with the values that they index into. The Perl analogy would be a hash. The keys of a hash are a set, but the values are not. This is a good thing.

No, SQL is not relational. If it was, it would be called RQL for "Relational Query Language," not "Structured Query Language." SQL was designed for Business Analysts, not programmers. It's a 4GL - one of those "natural language programming languages" that were touted in the 80's. That it's a complete failure both as a programming language and as a tool to work with databases is only to have been expected.

... a relation (what we normally think of as a table) is an unordered set (in the mathematical sense) of tuples. ... SQL allows us to add duplicate tuples in tables. That's one of the reasons they're called "tables" and not "relations". Sure, I can put a unique constraint across all of the columns, but why should I?

You're conflating so many different concepts here that I almost don't know where to start.

As for Bob and Alice, there's a very good reason why SQL not only gives back that answer, but SHOULD give back that answer, and it comes down to the Principle of Least Surprise.* Let's say you have the query SELECT foo FROM my_table; That returns back 10 rows. Under the current system, that means that there's 10 rows in my_table and here are the values in the foo column for each one. Under your idea, that means there are 10 distinct foos in my_table, regardless of how many rows.

Now, what hapens when I say SELECT foo, bar FROM my_table;? Under the current system, this still returns 10 rows. Under your system, it could return any number of rows, so long each row it returns is a unique (foo,bar) combination. For people that don't understand anything about set theory, that's somewhat counter-intuitive.

Believe me - I do understand your consternation. The idea that an operation upon a set returns a collection is counter-intuitive to those who understand any set theory at all. But, I point back to my earlier point - the set isn't the rows in the table taken as a whole - it's the range of the function between the primary key and the dependent columns. As long as your query returns the primary key, it is a set. If your query doesn't, then it doesn't returns a set.

To elucidate, let's look very quickly at what a SELECT does, under the hood. The basic operation, as you've noted, is to take a set (the PK tuple), apply a set of restrictions on it (either the PK tuple itself or the values tuple), then return the requested values from all rows that satisfy those restrictions. As long as your table(s) being queried was a set to begin with, the results will also be a set. However, your view of the results may or may not be a set, depending on what you've chosen to see.

Let's take your specific example:

SELECT cities.name FROM cities, customers WHERE customers.city_id = cities.city_id AND customers.credit_rating > 700
I'm going to rewrite as:
SELECT (cities.name) FROM (customers JOIN cities USING (city_id)) WHERE (customers.credit_rating > 700)
Assuming that customers and cities are both sets, then we have four sets involved in that query.
  1. cities
  2. customers
  3. the resultset
  4. the fromset
Your resultset is a set, believe it or not. It's the set of all rows from the intermediate table "customersXcities" that satisfies the criterion of customers.credit_rating > 700. This intermediate table has the same PK of your customers table. Think about that for a second - you're not querying the cities table directly. You're querying an intermediate table. Each row in your resultset is a row from that intermediate table. Then, the SELECT clause applies a set of vertical black strips to the resultset, each on top of a column. You turn off the strip by requesting that column's values be given to you. The SELECT clause isn't a set operation.

*: Don't give me that shocked look - you had to have known that the PoLS would make an appearance in this discussion. If only so that I could tweak you about how your surprise was anticipated before you were born. :-)


My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Replies are listed 'Best First'.
Re^4: (OT) Why SQL Sucks (with a little Perl to fix it)
by Ovid (Cardinal) on Dec 11, 2005 at 20:06 UTC

    You know, I think you and I are largely in agreement on what many of the core ideas are yet we differ about what we think valid behavior should be, though we'll quibble in the margins.

    A table with bad data in it is an improperly-designed table. A properly designed table cannot, by virtue of it being properly designed, have bad data.

    I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns. (Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF -- see my response to Celada for more discussion).

    As for your comments about "relations", I can only refer you to Date's work. It would take me a loooong time to write out exactly what I am referring to when I say "relation". This term goes back to Codd's initial formulation of the relational model and is loosely analogous to what we refer to as a table.

    For the moment, ignoring implementation in favor of behavior, if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?

    (I've a sneaky feeling that neither of us is going to give much ground here :)

    Cheers,
    Ovid

    New address of my CGI Course.

      if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set?

      Take a simple table with columns Name(Primary key (ignoring the possibility of two John Smiths for now) and age. You do select Age from table Table;. Your purpose is to calculate the average age. If the results set returned did not include all the ages, including duplicates, your calculation would be wrong.

      The tuples in the tabke are unique, but you aren't requesting tuples, you are requesting a substring of each tuple.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        That's not a general rule for preferring bags to sets, it's an exception to the rule.

        In any event, I think it's also a mistake. You want two aggregate functions, SUM and COUNT. If you're iterating over duplicate results like that, you've made a mistake. Ignoring that most (all?) SQL implementations support an AVG operator, you could always do this:

        SELECT SUM(age) / COUNT(age) AS average_age FROM table

        So with all due respect, even your special case doesn't seem to me to be applicable.

        Of course, that's brings up a related issue:

        SELECT SUM(customer_id) / COUNT(customer_id) FROM customers

        That's clearly nonsense but it's quite legal. If SQL followed the relational model it wouldn't even compile.

        Cheers,
        Ovid

        New address of my CGI Course.

      Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF

      Yes, because a PK is 1NF, not 3NF. There's two more normalizations required before you get to 3NF. :-)

      For the moment, ignoring implementation in favor of behavior, if I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?

      Let's start with one fact - the SELECT clause is handed a set of tuples and it applies a collection of vertical slices upon said set. So, your question can be rephrased as "What's the reason for not applying set semantics (SELECT DISTINCT) to said slice(s)?"

      There are two reasons:

      1. The number of rows in the resultset (which is a set if all the initial tables are sets) shouldn't change depending on what vertical elements you want to see exposed. Think about an Excel spreadsheet. If you hide column D, does the number of rows change? That's all you're doing in a SELECT clause.
      2. The performance hit for applying a DISTINCT is quite high. Try doing a relatively complex query both as SELECT (returning duplicates, as you see them) and SELECT DISTINCT (removing what you see as duplicates) and time them. Now, imagine that on every single query you will ever run in your life. And, no, there aren't that many more performance gains to be made.

      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns.

      I don't think that's what you want. Every row (tuple) has to be distinct from every other row. Having a single primary key column makes that so. The problem comes in when you try to query that table. Given this table:

       id |  city  | country 
      ----+--------+---------
        1 | Paris  | France
        2 | London | England
        3 | Paris  | Texas
        4 | Paris  | France
      

      The presence of the primary key (id) makes it a proper relation. However, when we try to query it like this:

       select city, country from test;
        city  | country 
      --------+---------
       Paris  | France
       London | England
       Paris  | Texas
       Paris  | France
      

      That is not a relation, because it has a duplicate in it. The problem isn't that the table isn't a relation, but that the result of our SQL query isn't a relation. So what is it? This gets to dragonchild's other point, to which you said:

      I'm asking for those cities, what's the logical reason for preferring a bag instead of a set? Can you show how this will consistently lead to more correct results?

      As I said the result of the above query is not a relation, but a report on the contents of the columns I queried, with the number of rows replicated. It makes sense in that it reports how many rows they are, and SQL is a row-oriented reporting language. So it is consistently more correct if my question is "What are the contents of the city and country colunns in all rows?" Of course, that's not usually a very useful question, but it is at least consistent with how SQL works.

      So that's how it makes sense. And as you and I agree, that makes it unrelational. It's a report on rows in a table, not a set created by querying a relation.

      —Theory

        Go back to that post and click the "response to Celada" link I had. I made it clear that the presense of a unique key is not enough. The duplicates are an issue in that case because the table is not in 3NF. If the table is is 3NF, then that unique constraint across all keys holds.

        Cheers,
        Ovid

        New address of my CGI Course.

      I'll agree to stand corrected if and only if I can also require that there is a unique constraint across all columns.

      Unique constraints like the sort you mean are normally handled through the creation of unique indexes on the columns involved. A properly designed table should forbid the insertion of bad data, either through column level constraints, row level constraints, unique indexes or triggers.

      Simply think that providing a primary key may satisfy that constraint but they miss that it may violate 3NF

      The tables violated normalization prior to adding the index, all the index does is allow you to deal with the fact, something that you can't do very well without it. If you could explain why that table makes more sense without the id column than with it then you might have a case, but I dont think you can.

      ---
      $world=~s/war/peace/g

Re^4: (OT) Why SQL Sucks (with a little Perl to fix it)
by fergal (Chaplain) on Dec 11, 2005 at 20:51 UTC
    Another critical understanding that you're missing - the "relational" part in an RDBMS isn't for the table - it's for both the row and how rows in different tables relate.

    "Relational" comes from "A Relational Model of Data for Large Shared Data Banks" which says

    The term relation is used here in its accepted mathematical sense. Given sets S1, S1, ···, Sn, (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S1, and so on. We shall refer to Sj as the jth domain of R. As defined above, R is said to have degree n. Relations of degree 1 are often called unary, degree 2 binary, degree 3 ternary, and degree n n-ary.

    That is the origin of the term in connection with databases. A relation is not a row, it is a set of rows and although 2 relations may have columns that correspond as primary/foreign keys it is just a linguistic coincidence that you might call this corresponsdence a "relationship".