in reply to How to create a Tree::DAG_Node tree with sorted hash elements

benny_vroom:

To answer properly, I'd need more information about how you intend to use the data structure. Usually, the selection of a data structure depends on the types of operations you intend to perform on it. Since you're talking about IP addresses and ranges, I'm going to assume that you want to find the smallest IP range you've entered given an IP address. Other forms of access will suggest other data structures.

At first glance, there are several special cases you need to deal with:

  1. Inserting data into the root of your tree,
  2. handling sets that don't overlap,
  3. handling identical ranges,
  4. handling sets where one is a proper subset of the other, and
  5. handling sets that intersect, where neither is a subset.

Before addressing any of these items, I'm going to mention [no such wiki, sentinel values]. When I tried your code, I rearranged the node creation and put the any/any case last. The code crashed, as expected. The problem is a common one with tree/list creation code. Typically when creating lists and trees, you have the problem of treating the root node specially. Inserting an item into a list/tree is normally simple, but you have to worry about the special case of replacing the root node, which can complicate your code.

Frequently, you can simplify your code by using a sentinel value. In this case, I selected any/any as the sentinel value because in this case, it encompasses all possible values. So by using it as the root node, you'll never have to worry about replacing the root node:

#build the tree my $root = Tree::DAG_Node->new; $root->attributes({ original=>"permit ip any any", source=>['0.0.0.0/0'], destination=>['0.0.0.0/0'] }); foreach my $node (@node_list){ my $location; $root->walk_down({ callback => sub { my $this_node = shift; if (compare_nodes($node,$this_node)){ $location = $this_node->address; 1; } else{ 0; } } }); $root->address($location)->add_daughter($node) if defined $root->address($location); }

Sorting the input

Because of case 5, you're not going to solve the problem by sorting alone. Yes, sorting may improve the situation somewhat, but you'll still have to deal with case 5. If you can put that case aside, then you can sort the list such that the largest ranges (i.e. smallest numbers after the '/') occur earlier in the insertion order.

The any/any case would be a good sentinel value, so if it's always part of your data set, you can simply insert it first. That way your insertion and query logic will be simpler than if you didn't have that case.

For example, when you rearrange your nodes such that the any/any case is last, your program dies because it tries to insert into the previously-remembered place. But since you don't have something prior to the root, you'd have to add special case logic to recognize that and replace the root. If you always sort the nodes then the any/any case would be first. I just made the any/any case the sentinel value, so I wouldn't have to complicate the code by detecting that special case.

Handling intersecting sets

If all your IP ranges are disjoin, identical or proper subsets, then sorting the list should solve the problem. But the instant you toss mutually intersecting sets into the mix, you have to figure out how you want to handle it. Sorting won't fix the problem, as neither node properly belongs "above" the other in the tree. So insertion will be a bit of a problem (i.e. "Where in the tree does this range go?"), and querying will also be a problem ("OK, I've found the range that my IP address is in. Do I keep looking for other matches?".).

There are several ways you can handle those disjoint sets, depending on how you access them, whether you ever delete ranges during a run, whether you ever add ranges between queries, etc. I'm just going to pick my favorite method for sake of discussion. Since there's an easy solution if I can ignore a special case, I'll try to remove the special case: In this case, I'd simply break the special cases into pieces--in this case split intersecting ranges into one part that is a proper subset of the other, and the other part that lies outside the other.

So, if your tree looks like:

0-255 (root) / \ 100-200(joe) 220-255(bob)

and you want to insert the range 150-250(sue), I'd break (sue) into pieces as needed. So during the insertion, I'd notice that the beginning of the range (150) falls in range (joe), but the end doesn't. So I'd break (sue) into 150-200(sue.1) and 200-250(sue.2), then push (sue.2) back into the list of things to insert. Then I'd insert (sue.1) into the tree getting:

0-255 (root) / \ 100-200(joe) 220-255(bob) | 150-200(sue.1)

Then I'd pull (sue.2) off the list of items to insert, and notice that the start of the range (151) is in (root), but the end of the range lies within (bob). So I'd then break (sue.2) into 151-220(sue.3) and 221-250(sue.4). I could then insert (sue.3) into (root), and then insert (sue.4) into (bob), resulting in:

0-255 (root) / | \ / | \ / | \ 100-200(joe) 151-220(sue.3) 220-255(bob) | | 150-200(sue.1) 221-250(sue.4)

Of course, this gives you the question of what to report when you query for address 224. Do you report (sue), (bob) or (sue.4)? If you want the most restrictive range (i.e., smallest) the user entered, you'd want to report (bob). So in that case, when creating the ranges, I'd add the original size of the range to the data structure so I can report the proper one:

0-255 (root;256) / | \ / | \ / | \ / | \ 100-200 151-220 220-255 (joe;100) (sue.3;100) (bob;35) | | 150-200 221-250 (sue.1;100) (sue.4;100)

Then, when reporting on 224, I could properly reply (bob) since it would have the smallest range when I traverse the tree. I'd start with (root;256), then see (bob;35), then see (sue.4;100) and notice that it's the leaf node. The smallest range I found during the traversal is (bob), so that's what I'd report.

Data structure representation

Of course, you'd have to figure out what you want to report for address 175. Do you want to report (joe), (sue) or both? Give some time to think about what your special cases are, and how you want to report them, and then thinking about the data structure will be much easier.

If it were me, I think I'd use an array to hold the original range definitions where each entry would be [min, max, size, <other_info> ], and arrays of array references to hold the tree, where each entry would be something like: [min, max, <ref of original definition>, [ <children> ]. I'd have the size in the original range definition to simplify querying (i.e., so I don't have to compute it continually).

I hope this is helpful...

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: How to create a Tree::DAG_Node tree with sorted hash elements
by benny_vroom (Initiate) on Jun 20, 2011 at 16:20 UTC
    @ roboticus:

    You are right, I should probably explain a bit about the background of my attempt. The Cisco firewalls we maintain at work have become bloated with several duplicate & overlapping rules due to mismanagement. Now a single firewall may have up to 900 ACL's. I want to sort & arrange them to identify rules that can be cleaned up. Most important need is to see which rules are sitting on top of smaller rules, some of them cover such wide access that they supersede quite a few other rules. So the visual representation of this supersets and subsets is very important.

    Now, an ACL will contain an array of source & destination networks (may be host address X.X.X.X/32 or network address Y.Y.Y.Y/Z) and another array of service ports (tcp/udp ports). Now the examples I have used are just over-simplified ACL, with one source & destination address each, to be used for unit testing the rest of the code. I have managed to write the code to convert config files to array of hash of arrays(each ACL is a hash) and also the code to compare two ACL's. Now all that is left to do is put them in the right order and display the subsets and supersets.

    Now back to your answer, you have summarized all the possibilities rather elegantly. I will not bother about case 5, I will treat such rules as unequal and yet insert them as peers to other rules (i.e, sue would be sister to bob & joe and I am not interested whether it can be split and inserted under the rest). I can think of no practical purpose to explore case 5 in my problem.

    I did consider the root node conundrum and that is why I started with "permit ip any any" as the root node. However, I didn't imagine that such a rule would arrive as part of the rule hash. Perhaps I should write a special case for that.

    Now that, all the obstructions are taken care of, what is the problem?

    I am not really sure how to go about sorting an array of hashes. I have a procedure that compares two hashes and returns 1 or 0.

    # compares two Tree::DAG Nodes and returns positive if right (second) +argument is bigger sub compare_nodes { my ($compare_node, $leaf_node) = @_; my $source_OK = compare_subnet($compare_node->{'attributes'}{'sour +ce'},$leaf_node->{'attributes'}{'source'}); my $destination_OK = compare_subnet($compare_node->{'attributes'}{ +'destination'},$leaf_node->{'attributes'}{'destination'}); print "$compare_node->{'attributes'}{'original'} is a daughter of +$leaf_node->{'attributes'}{'original'}\n\n" if ($source_OK && $destin +ation_OK); return 1 if ($source_OK && $destination_OK); return 0; }

    I feel I have to do something similar to overloading the "cmp" operator, but unsure of how to proceed. Any pointers would be of great help!

    Also, I realize that if I get the sorting done, I have acheived most of my work, except that of grouping the rules into sets and subsets. At this point, I wonder if it is necessary to build a tree structure. Each comparison is an expensive task. For e.g, comparing hash1{sourcem,destn, portso} against hash2{sourcep,destq, portr} will require (mxp + nXq + oxr) calls to Net::IP->overlaps() procedure. By the time we finish sorting and ready to build tree, each ACL has already been compared every other ACL atleast once and all necessary information to group into sets & subsets have already been computed at least once (perhaps not remembered to be used later). So is there a better way to achieve my goal rather than build a N-ary tree? Should I code my own sort code with facility to group entries in to sets & subsets?Perhaps I am just being overly paranoid about repeating the comparison, deep inside I just want the task to be done. What is your opinion?

    Apologies if I am confusing or misleading, I often tend to talk in a confusing manner.

      benny_vroom:

      Since you can ignore case 5, and it sounds like you're only interested in the case where some rules are possible subsets of others, then I think you can do it relatively easily. I think you could get away with sorting the list, like this:

      $ cat 910595.pl #!/usr/bin/perl use strict; use warnings; use Net::IP; my @IPranges = map { new Net::IP($_) } <DATA>; @IPranges = sort { # We want largest range first ($b->size() <=> $a->size()) # use IP as next key || ($a->ip() cmp $b->ip()) } @IPranges; print join("\n", @IPranges), "\n"; __DATA__ 1.2.3.0/24 1.2.0.0/16 192.168.0.0/16 192.168.4.0/24 192.168.4.192/26 127.0.0.1/32 10.0.0.0/8 $ ./910595.pl 10/8 1.2/16 192.168/16 1.2.3/24 192.168.4/24 192.168.4.192/26 127.0.0.1/32

      Then you can probably use the rest of your code as-is to find the related rules.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        Roboticus,

        Note that the output is still not in the preferred order. The preferred order is:

        10/8 1.2/16 1.2.3/24 192.168/16 192.168.4/24 192.168.4.192/26 127.0.0.1/32

        And this is happening because you are sorting based on network size first and if equal, going on to compare the IP address (which would simply say which one of either has a smaller decimal equivalent).

        10/8 may have a big network size, but it is not a supernet for 172.16/16.

        Also, my comparison is not that of single network address. I need to compare two hashes whose elements are network addresses. Anyways, I learnt that I can use custom expressions in place of cmp or <=> to perform sort operation.

        my $acl = {}; $acl->{source} = []; # array of Network address e.g, [10/8 +,172.16/16] $acl->{dest} = []; # same as above $acl->{original}= 'unknown'; # A string for original ACL e.g, permit + ip any 10/8 # After I create a hash I will store the hash reference in an array push @acl_grp, \%acl; # After populating the array of hash refs I want to sort them # I would do something similar to this foreach $acl ( sort compare @acl_grp ){ print $acl->{original},"\n"; } # compare function is coded as # return -1 if $b is a supernet of $a # return 1 if $a is a supernet of $b # return 0 for all other cases, mimicking the behaviour of <=> sub compare($$){some code};
        Assuming that above code works, I expect to get a list of ACL's sorted in order. Now my problem is, how can I indent them depending on whether above node is a super ACL or an unrelated ACL.