OO Identity Crisis

tlm has asked for the wisdom of the Perl Monks concerning the following question:

This is an "OO Design 101" question. I will cast it in terms of an example using trees, but I could have just as well used linked lists, graphs, etc.

Suppose that I want to design a class Tree to represent (you guessed it) trees, with the additional requirement that the tree must not contain "duplicate nodes" (the reason for the scare-quotes here will become apparent soon). Suppose also that the interface for this class includes a method add_node, whose job is to add a new node to the set N of the tree's nodes.

sub add_node {
  my $self = shift;
  my $node = shift;
  return if is_duplicate( $node ); # do nothing for dups
  # proceed to add node to the set of nodes for this tree
}
[download]

But how can is_duplicate() tell whether a new node is the duplicate of an already added node? Strictly speaking, it can't without some help from the client code, because each application has an idiosyncratic definition of equality.

For example, suppose the tree class is being used to represent processes running in a Unix system, and the tree class has to decide whether two Process objects $p0 and $p1 are the same. It won't do to compare the stringified versions of these objects (i.e. "$p0" eq "$p1") because the objects could be different by this criterion and yet be logically equivalent (e.g. though they are different objects all the fields that together uniquely characterize these objects as Unix processes, such as the start time, PID, PPID, etc., have identically matching values).

OK, to recap, we want to design this Tree class, and in particular, its add_node method to prevent adding "duplicate nodes."

One draconian possibility is this: require client classes to define an id method, returning an "application-unique" string identifying the object (e.g. the PID for the process). Then add_node would could be written like this:

sub add_node {
  my $self = shift;
  my $node = shift;
  return $self->{ N }{ $node->id } ||= $node;
}
[download]

The obvious shortcoming of this approach is the fact that it requires client classes to modify their API. A very unreasonable demand, especially considering that there is no consensus on what the name of this method ought to be. A different data structure class, say Digraph, written by someone else, may use ID or hash_code for the same purpose.

Another possibility would to define a wrapper class called Node, whose interface would consist (minimally) of one constructor and the methods id and object (or payload or content). The constructor for Node could include slots to specify both the payload object and some unique ID. It would be up to the client code to ensure that two objects receive the same ID if and only if they represent the same entity (e.g. the same Unix process).

With this scheme, add_node would be defined as above, and client code would invoke it like this:

$tree->add_node( Node->new( $some_random_object, $id ) );
[download]

A third alternative would be to add one configuration parameter to the Tree class. This optional parameter would be a code ref that the Tree class would apply to client objects to determine their unique ID. It would default to plain stringification:

sub { my $node = shift; return "$node" }
[download]

Then add_node would be defined like this:

sub add_node {
  my $self = shift;
  my $node = shift;
  my $id = get_id( $node );
  return $self->{ N }{ $id } ||= $node;
}

sub get_id {
  my $self = shift;
  my $node = shift;
  my $id_fxn = $self->id_fxn;
  return $id_fxn( $node );
}
[download]

Or, for greater flexibility, maybe the last two approaches could be combined:

sub get_id {
  my $self = shift;
  my $node = shift;  # $node->isa( 'Node' )
  return
    ref $node && $node->can( 'id' ) ? $node->id
                                    : $self->id_fxn->( $node );
}
[download]

...though this may be a case of misguided featurism.

I'd be interested in reading your opinions on these approaches, and/or suggestions for solving this design problem.

P.S. BTW, I realize that some of the solutions above could be implemented by overloading "", but, even though I understand the mechanics of overloading, I've been bitten enough by overloading-related bugs that I must conclude I am not yet at the level that I can use this tool safely in production code.

the lowliest monk

Comment on OO Identity Crisis Select or Download Code

Replies are listed 'Best First'.
Re: OO Identity Crisis by biosysadmin (Deacon) on Jun 20, 2005 at 14:27 UTC
In my Java classes at school this question came up when dealing with Java collections. Very often a data structure will need to enforce some sort of uniqueness, but there's no way for the data structure designer to know how to compare two objects. The way Java deals with this is to make the client class provide a comparison method. So, in your example, the Node class should be able to do the following: `if ( $node1->isEqual( $node2 ) ) { print "The two nodes are the same!\n"; }` [download] That's a simple and easy way to prevent duplicates without making your Tree class do more than it should.	[reply] [d/l]
Re: OO Identity Crisis by Transient (Hermit) on Jun 20, 2005 at 14:20 UTC
Java solves this with the Object class, which all objects derive from. It contains the "equals" method which returns the address of the object (IIRC) by default and can be overloaded by the child. While this is great for Java, obviously there's no such thing in Perl. However, you may want to just pick a method name and stick with it (a la "equals") and wrap it in an eval. If it bombs, just compare the addresses by default.	[reply]
Re^2: OO Identity Crisis by artist (Parson) on Jun 20, 2005 at 21:04 UTC
The 'equals' method also requires to compare the given object with each existing objects. Isn't it the case? --Artist	[reply]
Re^3: OO Identity Crisis by kaif (Friar) on Jun 21, 2005 at 06:44 UTC
Sort of. If we're using tlm's approach of a hashtable anyway (see `return $self->{ N }{ $id } \|\|= $node;`), then --- as far as I know --- Java's Hashtable combines the Object class's overridable methods equals and hashCode to do what we want. Basically, it avoids comparing the given object with each existing object by comparing it only to those with the same hashcode; but if you have a poor hashcode function (like `return 0;`), it will compare with everything.	[reply] [d/l] [select]
Re^2: OO Identity Crisis by BUU (Prior) on Jun 20, 2005 at 17:29 UTC
What do you mean "obviously there's no such thing in Perl."? What about the UNIVERSAL class, which every class implicitly inherits from?	[reply]
Re^3: OO Identity Crisis by Transient (Hermit) on Jun 20, 2005 at 17:35 UTC
You forgot the context: Java solves this with the Object class, which all objects derive from. It contains the "equals" method which returns the address of the object (IIRC) by default and can be overloaded by the child. There's no standard definition for "equals" which is what I was referring to. Yes, you are correct, though, an equals method could be introduced into UNIVERSAL and simply overridden wherever you like, but it's not going to be something you can rely upon unless they're your own objects.	[reply]
Re: OO Identity Crisis by herveus (Prior) on Jun 20, 2005 at 15:34 UTC
Howdy! Another way to look at the problem is "how do I find a given node in the collection?" If the items in the collection are unblessed scalars, you can compare them directly. The more "interesting" case is where the items are blessed scalars (or "objects"). The default comparison is "==". If they come up with the same address, they are the same thing. If ref(x) ne ref(y), the items are probably not equal, but your situation may permit objects of different classes to still come up "equal". Any deeper comparison must (as you note) be a method on the object. It's already been noted that Java provides an "equals" method in the uber-superclass "Object", so you can always invoke .equals on an object. Something along those lines for your node class(es) would be indicated. Pick an appropriate name for the method and go forth. is_duplicate could be a wrapper around a find() method that returns the given object if it is in the collection, or some other useful value. find() could be responsible for iterating over the collection in an appropriate manner, applying the comparison function. yours, Michael	[reply]
Re^2: OO Identity Crisis by snoopy (Curate) on Jun 21, 2005 at 07:00 UTC
>> If the items in the collection are unblessed scalars, you can compare them >> directly. The more "interesting" case is where the items are blessed scalars >> (or "objects"). The default comparison is "==". Class::DBI lets you do a simple "eq" test for sameness of database objects. It uses overload to return an encoded key value in a string context, so that `$a eq $b` works anywhere that primary keys have been declared and both are from the same table. Update: This overloading leads to a trivial Schwartzian Transform: `@t = map { [$_, "$_"] } @u`, potentially making any sorting or duplicate checking lightening fast!!	[reply] [d/l] [select]
Re: OO Identity Crisis by Animator (Hermit) on Jun 20, 2005 at 15:24 UTC
If I would have to implement it then I think I would allow these things: (I leave it up to you to decide wheter they are good or bad) First one, is the one that is suggested: see if the nodes define a compare/equal method of some sort. (check it with the ISA->can method) Second one, allow a coderef (anonyoums or not, that's irrelevant) to be passed to the new method (when you're creating the object). (You could also allow a (different) coderef to be passed to the add_node routine) Third one, when neither the first nor the second is available use '==' or 'eq' or both. (Perhaps there is someone out there that already created a Node class, and has overridden ==/eq and wants to use your Tree class.)	[reply]
Re: OO Identity Crisis by ikegami (Patriarch) on Jun 20, 2005 at 15:25 UTC
I find if odd that the module's caller must create the node. Do you create a hash node before putting data into a hash? No, the node (containing the key and the value) is created by the hash itself. A similar approach could be taken here. It's not like Node would ever need to be subclassed.	[reply]
Re: OO Identity Crisis by kscaldef (Pilgrim) on Jun 22, 2005 at 03:43 UTC
My opinion is that the cleanest way of dealing with this is (more-or-less) your suggestion to let a Tree instance have an optional equality function attribute.	[reply]