in reply to Trees in SQL, adjacency lists and sorting.

This is, yet again, another node displaying a misunderstanding of relational theory. These things are sets, which are inherently unordered. Ordering is provided as an additional feature, like triggers. It's not part of the relational theory and, in fact, violates relational theory. That said ...
  1. I'm assuming you're going to have a row that says "A | NULL | 0". You need to represent the root somewhere. Implicit representation is not going to cut it.
  2. This is going to be an absolute nightmare to update. For example, if you remove a node and promote one of the children (a common operation in red-black trees, for example), You're going to have to recalculate in your application every single depth. Furthermore, if your application code is wrong in any way, the database isn't going to help you catch your error, particularly in the depth area.
  3. Trees are ordered with respect to siblings. In your case, you're assuming that B comes before C and D comes before E. How are you planning on capturing that in your schema? You're storing depth (which is easily derived and doesn't really help improving select speed), but you're not storing sibling sort order (which isn't derivable because you cannot depend on the order the RDBMS will store the rows).
  4. Assuming you have all of that, let's see what your sort would look like in Perl:
    my @nodes = map { $_->[0] } sort { # What goes here?? } map { [ $_->{node_id}, $_->{depth}, $_->{sibling_order} ] } $sth->fetchall_hashref();

    If you wanted a level-order traversal, you'd use:

    $a->{depth} <=> $b->{depth} || $a->{sibling_order} <=> $b->{sibling_order}
    Except, that gives you A-B-C-D-E-F-G, which isn't what you want.

    Take a look at Tree, particularly the traversal code I wrote in order to provide traversals as an iterator into the tree versus an array (which is what you're trying to do in an ORDER BY clause). The code is non-trivial, particularly in that it requires maintaining a stack to keep track of what comes next.

In short, you cannot do it with ANSI SQL-1992, SQL-1999, or even SQL-2003. You would have to use a vendor-specific extension that allows user-defined comparators for sorting. I recommend PostgreSQL as MySQL, Oracle, Sybase, and SQL*Server all don't provide this. SQLite definitely doesn't provide this.


My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Replies are listed 'Best First'.
Re^2: Trees in SQL, adjacency lists and sorting.
by BUU (Prior) on Jan 06, 2006 at 04:17 UTC
    So what's the Right Way to store a tree? You seem to have fairly strong opinions on the matter and I'm always willing to learn. As for the every single depth bit, if I delete a node I can update all its children in a single query.
      Data structures exist to organize data in ways that are useful for what we're going to do with it. For example, if all you're doing is storing a HoHoHoHoH that you want to walk but never search, then a database is an extremely poor solution and something like DBM::Deep is preferable. If you have a bunch of items that are on the same level and you want to do random-access searches based on arbitrary criteria, then a relational database is much better than DBM::Deep.

      How does this relate to trees? You have a perfectly fine way to store a tree in a database . . . depending on what you're trying to do with it. You're using a variant of the self-referential method which stores every single node-to-node relationship, not just the immediate ones. It looks like it would be extremely fast at finding all descendents and all ancestors of a given node, doing each one in one query. However, it will be no better than the standard self-referential method in finding all siblings of a given node.

      It sounds like you really want to use the nested-sets method. It's where you don't store a reference to your parent node, but instead store a left and right bound which indicates where in the nested sets you stand. To find all your ancestors, you find all nodes where "my_id BETWEEN left AND right". To find all your descendents, you find all nodes where "id BETWEEN my_left AND my_right". It's no better than the standard self-referential in finding all siblings.

      Both methods (yours and self-referential), it sounds, have near-equal benefits when it comes to speeding up searches. Yours is slightly better when doing inserts and updates at the cost of significantly more disk, but worse when doing deletes. That may be an acceptable tradeoff depending on your intended usage. For example, if your spec says you will never delete a node except in rare situations, this is probably a good direction to look at.

      As for the every single depth bit, if I delete a node I can update all its children in a single query.

      I don't think so.

      Personally, I'm not sold on the depth thing. I think that storing every node-to-node relationship is definitely a solid strategy. I don't know what having the depth gets you. Personally, I'd use this as an adjunct table to the standard self-referential. Yes, you'd have duplicated information between the main table and the relationships table (the direct parent-child relationship would be duplicated), but if that's the optimization I need, then that's fine because the performance cost of the loss of normalization is made up by the speed gains I get with the caching.


      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?