Re^2: tree structure of database

I am more than a little puzzled by Joe Celko's advocacy here of what he calls the "nested set model". He appears to be advocating a table that stores visit order instead of parent node ids, like this:

OrgChart
emp         lft rgt
======================
'Albert'      1   12
'Bert'        2    3
'Chuck'       4   11
'Donna'       5    6
'Eddie'       7    8
'Fred'        9   10
[download]

In the table above, the "lft" column holds the visit order of the node when descending through the tree. The "rgt" column holds the visit order when ascending. His argument is that it makes queries for "all children" and "all parents" of a node much easier.

I can see some argument for it in a read-only database or data mining software, but for modifiable data it is a disaster. To begin with it is severely denormalized. The value in the "lft" is dependent not only on the key, but on every node before it in the visit order. The value in the "rgt" column" is dependent on every node before it and on every child, grand-child, and great-great-great-grandchild node. Together each record is dependent on nearly every other record in the table. This means that it is impossible to insert, delete, or move a single record in the table without editing nearly the entire table as whole, especially if the node is at the beginning of the visit order. If you insert a record, the "lft" and "rgt" column of every record in the visit order after it must be updated. For every parent, the "rgt" node must be updated as well.

To implement this requires strong transaction support. But even with strong transaction support, editing an entire table for each insertion is a bad idea. For the brief period while the edits are being made the transaction will need an exclusive lock on the table as a whole. Users who only need to read the data or need to insert or delete at another point in the graph will be shut out. This is unacceptable in a high traffic database. For modifiable tables, this implementation is simply not scalable.

The beauty of the traditional foreign key model is that each insertion of a node affects only the node in question. Insertions are safe actions even in a DBMS with weak transaction support. Each repositioning of a child (change in parent node) similarly affects only one record and can be carried out safely even in a database without transaction support.

When tree relationships are represented by storing each node's parent (instead of a pair of lft-rgt columns), even deletions affect only a small subset of nodes, not the entire table. When a node is deleted one of two things happen: all of the descendants will be deleted with the node (known as cascading delete) or the immediate children will be assigned to a new node. One can think of this as either (a) the boss and his entire department being axed or (b) the boss being fired and his team assigned to a new manager. In either case, only a subset of rows (employees) is involved, not the whole table (company). If the DBMS supports row level locking, then editors and readers of different branches of the tree will be able to go about their business, just as employees in other divisions can at least try to ignore the firing going on in the cubicles down the hall and get on with work.

Even in a read-only database, storing tree relationships as lft-rgt values rather than as a parent foreign key can be problematic. "lft"-"rgt" columns are efficient at selecting "all children of" or "all parents of", but they make it very difficult to select "all siblings of" or "all immediate children of". There is no one value one can use to find all of these nodes. Each child only stores the sequence number on ascent and descent and this is different for each child. You can select "all children of X" easily because you can query for all nodes with a "lft" value greater than X and a "rgt" value less than X, but you can't easily tell how deep you are in the tree, i.e. whether you have an immediate child, grandchild, or great-great-great-grandchild.

Even when one does want "all children of", this encoding is problematic. To construct a visit order we have to make an assumption about the order in which siblings are visited. If you sort rows based on the value of "lft" or "rght", you will get that assumed ordering of siblings. If you want to change the ordering, you will need to track down all of the nodes with a common parent, resort them, and renumber them.

Hard-coding sorting logic violates one of the main design benefits that motivated the initial development and popularization of the relational model: the ability to provide a measure of future-proofing. Data tends to stay with an organization, but the ways they use it and want to view it change dramatically over time. This includes the ways in which an organization wishes to sort it.

There are, of course, use cases where speed is much more important than either future proofing or flexibility. In those cases I see nothing wrong with hard coding specific sort orders (data mining of archival data being one common example). But in a live system that needs to support either ad-hoc reporting or real-time editing of a tree, "lft"-"rgt" columns instead of parent foreign keys seem to me a really bad idea.

Best, beth

Comment on Re^2: tree structure of database Download Code