Re^3: Trees in SQL, adjacency lists and sorting.

Data structures exist to organize data in ways that are useful for what we're going to do with it. For example, if all you're doing is storing a HoHoHoHoH that you want to walk but never search, then a database is an extremely poor solution and something like DBM::Deep is preferable. If you have a bunch of items that are on the same level and you want to do random-access searches based on arbitrary criteria, then a relational database is much better than DBM::Deep.

How does this relate to trees? You have a perfectly fine way to store a tree in a database . . . depending on what you're trying to do with it. You're using a variant of the self-referential method which stores every single node-to-node relationship, not just the immediate ones. It looks like it would be extremely fast at finding all descendents and all ancestors of a given node, doing each one in one query. However, it will be no better than the standard self-referential method in finding all siblings of a given node.

It sounds like you really want to use the nested-sets method. It's where you don't store a reference to your parent node, but instead store a left and right bound which indicates where in the nested sets you stand. To find all your ancestors, you find all nodes where "my_id BETWEEN left AND right". To find all your descendents, you find all nodes where "id BETWEEN my_left AND my_right". It's no better than the standard self-referential in finding all siblings.

Both methods (yours and self-referential), it sounds, have near-equal benefits when it comes to speeding up searches. Yours is slightly better when doing inserts and updates at the cost of significantly more disk, but worse when doing deletes. That may be an acceptable tradeoff depending on your intended usage. For example, if your spec says you will never delete a node except in rare situations, this is probably a good direction to look at.

As for the every single depth bit, if I delete a node I can update all its children in a single query.

I don't think so.

Let's say you have the tree:

       A
      /
     B
    /
   C
  /
 D
[download]

This translates to the following table:

 id | ancestor | depth
 A  |  NULL    |   0
 B  |  A       |   1
 C  |  A       |   2
 C  |  B       |   1
 D  |  A       |   3
 D  |  B       |   2
 D  |  C       |   1
[download]

Let's say you remove B. That means you have to do the following:

Remove the entry where id == B
Decrement the depth for all nodes which have an ancestor relationship for B, but only for those rows which have a depth greater than the distance between the node and B.
Remove all entries where ancestor == B

Otherwise, you'll have an invalid depth in your table between some descendent of B and at least one of its ancestors.

Personally, I'm not sold on the depth thing. I think that storing every node-to-node relationship is definitely a solid strategy. I don't know what having the depth gets you. Personally, I'd use this as an adjunct table to the standard self-referential. Yes, you'd have duplicated information between the main table and the relationships table (the direct parent-child relationship would be duplicated), but if that's the optimization I need, then that's fine because the performance cost of the loss of normalization is made up by the speed gains I get with the caching.

My criteria for good software:

Does it work?
Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Comment on Re^3: Trees in SQL, adjacency lists and sorting. Select or Download Code