How to create a Tree::DAG_Node tree with sorted hash elements

benny_vroom has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, I want to sort a list of ACL and group them in to sets and subsets. I decided to go with creating a Tree and populate the rules as leaves of the tree. If a rule is superseded by another rule, it will be added as a daughter node to the latter. If a rule doesn't fit in to any other rule, it will become daughter of the root. By definition, the ACL "permit ip any any" will become the root. I have a rudimentary code (with 3 statically assigned leaf nodes). It walks down the root node and remembers every location where a leaf can fit. Once it reaches a location where the leaf doesn't fit, it inserts the leaf in the last remembered location. Below is the code:

use Tree::DAG_Node;
use Data::Dumper;
use Net::IP;
# compares two Tree::DAG Nodes and returns positive if right (second) 
+argument is bigger
sub compare_nodes {
    my ($compare_node, $leaf_node) = @_;
    my $source_OK = compare_subnet($compare_node->{'attributes'}{'sour
+ce'},$leaf_node->{'attributes'}{'source'});
    my $destination_OK = compare_subnet($compare_node->{'attributes'}{
+'destination'},$leaf_node->{'attributes'}{'destination'});
    print "$compare_node->{'attributes'}{'original'} is a daughter of 
+$leaf_node->{'attributes'}{'original'}\n\n" if ($source_OK && $destin
+ation_OK);
    return 1 if ($source_OK && $destination_OK);
    return undef;
}

# create nodes for tree construction later on
my $root = Tree::DAG_Node->new;
$root->attributes ({
        original => 'permit ip any any',
        source => ['0.0.0.0/0'], 
        destination => ['0.0.0.0/0']
    });

my $leaf_1 = Tree::DAG_Node->new;
$leaf_1->attributes ({
        original => 'permit ip 10.0.0.0/8 172.16.0.0/16',
        source => ['10.0.0.0/8'], 
        destination => ['172.16.0.0/16']
    });

my $leaf_2 = Tree::DAG_Node->new;
$leaf_2->attributes ({
        original => 'permit ip 10.0.0.0/16 172.16.0.0/24',
        source => ['10.0.0.0/16'], 
        destination => ['172.16.0.0/24']
    });

my $leaf_3 = Tree::DAG_Node->new;
$leaf_3->attributes ({
        original => 'permit ip 192.168.0.0/16 12.16.0.0/24',
        source => ['192.168.0.0/16'], 
        destination => ['12.16.0.0/24']
    });

my @node_list;
push @node_list, $leaf_1, $leaf_2, $leaf_3 ;

# Subnet comparison
# returns true if $x is a subset of $y
sub compare_subnet {
    my ($x, $y) = @_;
    a_in_b ($_, @$y) or return '' foreach @$x;        #Check if each e
+lement of 
    return 1;
};


# check one ip address against array of ip addresses
# first argument is single address, second one is array
sub a_in_b  {
  my $node1 = shift(@_);
  my @ip_list = @_;
  for my $node2 (@ip_list) {
    my $ip1 = new Net::IP ($node1) || die (Net::IP::Error());
    my $ip2 = new Net::IP ($node2) || die (Net::IP::Error());
    if (($ip1->overlaps($ip2) == $IP_A_IN_B_OVERLAP) || ($ip1->overlap
+s($ip2) ==  $IP_IDENTICAL))  {
      return 1;
    }
    
  }
  return '';
}

#build the tree
foreach my $node (@node_list){
    my $location;
    $root->walk_down({
            callback => sub {
                my $this_node = shift;
                if (compare_nodes($node,$this_node)){
                    $location = $this_node->address;
                    1;
                }
                else{
                    0;
                }
            }

        });
    $root->address($location)->add_daughter($node) if defined $root->a
+ddress($location);
}

# print the tree
$root->walk_down({
    callback => sub {
        my $node = shift;
            print "  " x $_[0]->{_depth};
            print $node->{'attributes'}{'original'}, "\n";
        },
    _depth => 0,
});

$root->delete_tree();
[download]

I basically want an output such as this:

permit ip any any
  permit ip 10.0.0.0/8 172.16.0.0/16
    permit ip 10.0.0.0/16 172.16.0.0/24
  permit ip 192.168.0.0/16 12.16.0.0/24
[download]

Obviously this approach is wrong for several reasons. For e.g, it expects the array entries to be sorted. If a smaller ACL enters the tree before a bigger ACL, when the bigger one's turn arrives, it is not inserted between the parent and the leaf, instead it is entered as a sister node.

For e.g, if I have elements such as:

root > A > B > C
[download]

And they are stored in the array in the below order:

[A,C,B]
[download]

I would get the result as:

root
 A
  C
  B
[download]

instead of:

root
 A
  B
   C
[download]

I have run out of ideas to see how to code this. I understand that the input needs to be sorted in some way, but this case is different from sorting numbers or strings in that two elements can be unrelated (neither bigger, smaller nor equal) to each other. And I am not a computer scientist and not so familiar with the theories behind sorting.

How does this problem look to an experienced programmer? Is there something I should do differently?

Comment on How to create a Tree::DAG_Node tree with sorted hash elements Select or Download Code

Replies are listed 'Best First'.
Re: How to create a Tree::DAG_Node tree with sorted hash elements by roboticus (Chancellor) on Jun 19, 2011 at 14:36 UTC
benny_vroom: To answer properly, I'd need more information about how you intend to use the data structure. Usually, the selection of a data structure depends on the types of operations you intend to perform on it. Since you're talking about IP addresses and ranges, I'm going to assume that you want to find the smallest IP range you've entered given an IP address. Other forms of access will suggest other data structures. At first glance, there are several special cases you need to deal with: Inserting data into the root of your tree, handling sets that don't overlap, handling identical ranges, handling sets where one is a proper subset of the other, and handling sets that intersect, where neither is a subset. Before addressing any of these items, I'm going to mention [no such wiki, sentinel values]. When I tried your code, I rearranged the node creation and put the any/any case last. The code crashed, as expected. The problem is a common one with tree/list creation code. Typically when creating lists and trees, you have the problem of treating the root node specially. Inserting an item into a list/tree is normally simple, but you have to worry about the special case of replacing the root node, which can complicate your code. Frequently, you can simplify your code by using a sentinel value. In this case, I selected any/any as the sentinel value because in this case, it encompasses *all* possible values. So by using it as the root node, you'll never have to worry about replacing the root node: `#build the tree my $root = Tree::DAG_Node->new; $root->attributes({ original=>"permit ip any any", source=>['0.0.0.0/0'], destination=>['0.0.0.0/0'] }); foreach my $node (@node_list){ my $location; $root->walk_down({ callback => sub { my $this_node = shift; if (compare_nodes($node,$this_node)){ $location = $this_node->address; 1; } else{ 0; } } }); $root->address($location)->add_daughter($node) if defined $root->address($location); }` [download] Sorting the input Because of case 5, you're not going to solve the problem by sorting alone. Yes, sorting may improve the situation somewhat, but you'll still have to deal with case 5. If you can put that case aside, then you can sort the list such that the largest ranges (i.e. smallest numbers after the '/') occur earlier in the insertion order. The any/any case would be a good sentinel value, so if it's always part of your data set, you can simply insert it first. That way your insertion and query logic will be simpler than if you didn't have that case. For example, when you rearrange your nodes such that the any/any case is last, your program dies because it tries to insert into the previously-remembered place. But since you don't have something prior to the root, you'd have to add special case logic to recognize that and replace the root. If you always sort the nodes then the any/any case would be first. I just made the any/any case the sentinel value, so I wouldn't have to complicate the code by detecting that special case. Handling intersecting sets If all your IP ranges are disjoin, identical or proper subsets, then sorting the list should solve the problem. But the instant you toss mutually intersecting sets into the mix, you have to figure out how you want to handle it. Sorting won't fix the problem, as neither node properly belongs "above" the other in the tree. So insertion will be a bit of a problem (i.e. "Where in the tree does this range go?"), and querying will also be a problem ("OK, I've found the range that my IP address is in. Do I keep looking for other matches?".). There are several ways you can handle those disjoint sets, depending on how you access them, whether you ever delete ranges during a run, whether you ever add ranges between queries, etc. I'm just going to pick my favorite method for sake of discussion. Since there's an easy solution if I can ignore a special case, I'll try to remove the special case: In this case, I'd simply break the special cases into pieces--in this case split intersecting ranges into one part that is a proper subset of the other, and the other part that lies outside the other. So, if your tree looks like: `0-255 (root) / \ 100-200(joe) 220-255(bob)` [download] and you want to insert the range 150-250(sue), I'd break (sue) into pieces as needed. So during the insertion, I'd notice that the beginning of the range (150) falls in range (joe), but the end doesn't. So I'd break (sue) into 150-200(sue.1) and 200-250(sue.2), then push (sue.2) back into the list of things to insert. Then I'd insert (sue.1) into the tree getting: `0-255 (root) / \ 100-200(joe) 220-255(bob) \| 150-200(sue.1)` [download] Then I'd pull (sue.2) off the list of items to insert, and notice that the start of the range (151) is in (root), but the end of the range lies within (bob). So I'd then break (sue.2) into 151-220(sue.3) and 221-250(sue.4). I could then insert (sue.3) into (root), and then insert (sue.4) into (bob), resulting in: `0-255 (root) / \| \ / \| \ / \| \ 100-200(joe) 151-220(sue.3) 220-255(bob) \| \| 150-200(sue.1) 221-250(sue.4)` [download] Of course, this gives you the question of what to report when you query for address 224. Do you report (sue), (bob) or (sue.4)? If you want the most restrictive range (i.e., smallest) the user entered, you'd want to report (bob). So in that case, when creating the ranges, I'd add the original size of the range to the data structure so I can report the proper one: `0-255 (root;256) / \| \ / \| \ / \| \ / \| \ 100-200 151-220 220-255 (joe;100) (sue.3;100) (bob;35) \| \| 150-200 221-250 (sue.1;100) (sue.4;100)` [download] Then, when reporting on 224, I could properly reply (bob) since it would have the smallest range when I traverse the tree. I'd start with (root;256), then see (bob;35), then see (sue.4;100) and notice that it's the leaf node. The smallest range I found during the traversal is (bob), so that's what I'd report. Data structure representation Of course, you'd have to figure out what you want to report for address 175. Do you want to report (joe), (sue) or both? Give some time to think about what your special cases are, and how you want to report them, and then thinking about the data structure will be much easier. If it were me, I think I'd use an array to hold the original range definitions where each entry would be `[min, max, size, <other_info> ]`, and arrays of array references to hold the tree, where each entry would be something like: `[min, max, <ref of original definition>, [ <children> ]`. I'd have the size in the original range definition to simplify querying (i.e., so I don't have to compute it continually). I hope this is helpful... ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^2: How to create a Tree::DAG_Node tree with sorted hash elements by benny_vroom (Initiate) on Jun 20, 2011 at 16:20 UTC
@ roboticus: You are right, I should probably explain a bit about the background of my attempt. The Cisco firewalls we maintain at work have become bloated with several duplicate & overlapping rules due to mismanagement. Now a single firewall may have up to 900 ACL's. I want to sort & arrange them to identify rules that can be cleaned up. Most important need is to see which rules are sitting on top of smaller rules, some of them cover such wide access that they supersede quite a few other rules. So the visual representation of this supersets and subsets is very important. Now, an ACL will contain an array of source & destination networks (may be host address X.X.X.X/32 or network address Y.Y.Y.Y/Z) and another array of service ports (tcp/udp ports). Now the examples I have used are just over-simplified ACL, with one source & destination address each, to be used for unit testing the rest of the code. I have managed to write the code to convert config files to array of hash of arrays(each ACL is a hash) and also the code to compare two ACL's. Now all that is left to do is put them in the right order and display the subsets and supersets. Now back to your answer, you have summarized all the possibilities rather elegantly. I will not bother about case 5, I will treat such rules as unequal and yet insert them as peers to other rules (i.e, sue would be sister to bob & joe and I am not interested whether it can be split and inserted under the rest). I can think of no practical purpose to explore case 5 in my problem. I did consider the root node conundrum and that is why I started with "permit ip any any" as the root node. However, I didn't imagine that such a rule would arrive as part of the rule hash. Perhaps I should write a special case for that. Now that, all the obstructions are taken care of, what is the problem? I am not really sure how to go about sorting an array of hashes. I have a procedure that compares two hashes and returns 1 or 0. # compares two Tree::DAG Nodes and returns positive if right (second) +argument is bigger sub compare_nodes { my ($compare_node, $leaf_node) = @_; my $source_OK = compare_subnet($compare_node->{'attributes'}{'sour +ce'},$leaf_node->{'attributes'}{'source'}); my $destination_OK = compare_subnet($compare_node->{'attributes'}{ +'destination'},$leaf_node->{'attributes'}{'destination'}); print "$compare_node->{'attributes'}{'original'} is a daughter of +$leaf_node->{'attributes'}{'original'}\n\n" if ($source_OK && $destin +ation_OK); return 1 if ($source_OK && $destination_OK); return 0; } [download] I feel I have to do something similar to overloading the "cmp" operator, but unsure of how to proceed. Any pointers would be of great help! Also, I realize that if I get the sorting done, I have acheived most of my work, except that of grouping the rules into sets and subsets. At this point, I wonder if it is necessary to build a tree structure. Each comparison is an expensive task. For e.g, comparing hash1{sourcem,destn, portso} against hash2{sourcep,destq, portr} will require (mxp + nXq + oxr) calls to Net::IP->overlaps() procedure. By the time we finish sorting and ready to build tree, each ACL has already been compared every other ACL atleast once and all necessary information to group into sets & subsets have already been computed at least once (perhaps not remembered to be used later). So is there a better way to achieve my goal rather than build a N-ary tree? Should I code my own sort code with facility to group entries in to sets & subsets?Perhaps I am just being overly paranoid about repeating the comparison, deep inside I just want the task to be done. What is your opinion? Apologies if I am confusing or misleading, I often tend to talk in a confusing manner.	[reply] [d/l]
Re^3: How to create a Tree::DAG_Node tree with sorted hash elements by roboticus (Chancellor) on Jun 20, 2011 at 17:49 UTC
benny_vroom: Since you can ignore case 5, and it sounds like you're only interested in the case where some rules are possible subsets of others, then I think you can do it relatively easily. I think you could get away with sorting the list, like this: `$ cat 910595.pl #!/usr/bin/perl use strict; use warnings; use Net::IP; my @IPranges = map { new Net::IP($_) } <DATA>; @IPranges = sort { # We want largest range first ($b->size() <=> $a->size()) # use IP as next key \|\| ($a->ip() cmp $b->ip()) } @IPranges; print join("\n", @IPranges), "\n"; __DATA__ 1.2.3.0/24 1.2.0.0/16 192.168.0.0/16 192.168.4.0/24 192.168.4.192/26 127.0.0.1/32 10.0.0.0/8 $ ./910595.pl 10/8 1.2/16 192.168/16 1.2.3/24 192.168.4/24 192.168.4.192/26 127.0.0.1/32` [download] Then you can probably use the rest of your code as-is to find the related rules. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re^4: How to create a Tree::DAG_Node tree with sorted hash elements by benny_vroom (Initiate) on Jun 21, 2011 at 14:41 UTC
Re^5: How to create a Tree::DAG_Node tree with sorted hash elements by roboticus (Chancellor) on Jun 21, 2011 at 16:12 UTC
Some notes below your chosen depth have not been shown here
Re: How to create a Tree::DAG_Node tree with sorted hash elements by benny_vroom (Initiate) on Jun 19, 2011 at 09:18 UTC
I am also wondering whether I should write my own simple tree procedure. I see the term overkill so often alongside Tree::DAG_Node. I am not particularly comfortable in writing something ground up. Still open to ideas and looking forward to some help!	[reply]

Sorting the input

Handling intersecting sets

Data structure representation