Algorithom to find overlaping subnets (Internet IPv4)

chrestomanci has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Algorithom to find overlaping subnets (Internet IPv4) by rg0now (Chaplain) on Sep 19, 2011 at 12:29 UTC
50 thousand or so prefixes is not that many, so a quick and dirty O(n^2) compare-each-subnet-pair would do it based on, say, Net::IP `overlap`. If speed is a concern, then I would go with your "build a tree of up to 32 levels" idea. Such a tree is called a prefix trie, and it is often used in IP routing tables. You can code your own or choose something from CPAN (say, Net::IPTrie or Tree::Trie). Note, however, that checking overlaps is more difficult than your "inserting a network I find the space taken, I would have found an overlap" idea, because not just the node itself, but all its parents must be checked as well. In particular, you have an overlap either if the node corresponding to your subnet is taken or any of its parents is taken. rg0now	[reply] [d/l]
Re^2: Algorithom to find overlaping subnets (Internet IPv4) by chrestomanci (Priest) on Sep 20, 2011 at 13:47 UTC
Thanks for your informative reply. If I had know that mod::Net::IPTrie or mod::Tree::Trie existed I would have use them, unfortunately I did not know they existed and did not know the terminology to search for them on CPAN, and I thought it would take to long to implement something myself. Instead I wrote some code using string representations of the binary bits in a database using DBIx::Class. My DBIC table defintion looks like this: `__PACKAGE__->table('tblSubNet'); __PACKAGE__->add_columns( 'id' => { data_type=>'int', is_auto_increment=>1 + }, 'ip' => { data_type=>'varchar', size=>15 + }, 'mask' => { data_type=>'int' + }, 'name' => { data_type=>'varchar', size=>255, is_nullable=>1 + }, 'start' => { data_type=>'int' + }, # the IP address as A 32 bit number 'bitpattern' => { data_type=>'varchar', size=>32 + }, ); __PACKAGE__->set_primary_key('id'); __PACKAGE__->add_unique_constraint(['bitpattern']);` [download] Once I have populated the table of subnets, I then search it like this: my $order_by_size_rs = $all_subnets_rs->search({'source'=>$source},{'o +rder_by'=>{'-asc','mask'}}); while( my $network = $order_by_size_rs->next ) { my $overlaping_nets_rs = $net_rs->search({ 'mask' => { '>', $network->mask() }, 'bitpattern' => { 'like', $network->bitpattern().'%' }, }, { 'order_by'=>{'-asc',['mask','start']} }); if( $overlaping_nets_rs->count ) { printf "Subnet %s/%d (%s) has %d overlaps:\n", $network->ip, $network->mask, $network->name, $overlaping_nets_rs->count; ... # Code to add the overlaps to the report. } } [download] Using this algorithm I was able to search through the 50_000 subnets searching for overlaps in about 10 minutes. (On a 3GHz Linux box).	[reply] [d/l] [select]
Re: Algorithom to find overlaping subnets (Internet IPv4) by Khen1950fx (Canon) on Sep 18, 2011 at 21:58 UTC
Maybe this would help: Re: Subnet Overlap(fixes)	[reply]
Re: Algorithom to find overlaping subnets (Internet IPv4) by BrowserUk (Patriarch) on Sep 19, 2011 at 14:40 UTC
An interesting question is what exactly do you mean by overlap? For example: for some applications, a subnet that is entirely contained by another: `subnet 1: S..........E subnet 2: S.....E` [download] Can easily be done away with entirely. But subnets that overlaps but not completely: `subnet 1: S..........E subnet 2: S.....E subnet 3: S.............E subnet 4: S......................E` [download] Will rarely be able to be coalesced directly into a single subnet (#3), as the 'nearest' subnet that would contain both (#4) will usually also contain addresses not contained in the original set. And given 50_000 inputs, the likely scenario -- in the absence of more specificity regarding the distribution of the subnets -- is that they will form a tree with a few large, 'root' level subnets each containing a hierarchy of smaller subnets: `s...................................e s......e s........... +....e s...............e s........e s.e s...........e s.e s.. +...e s......e s....e s......e` [download] That suggests a strategy whereby instead of sorting the subnets by start/end address, you should sort them by subnet size. The first (largest) therefore will not be contained by any of the others, so can be removed from the list, and used as the root of a tree. It may of course, overlap with one or more of the next few largest, but except for the rare event where the two can be combined into a single, unextended subnet, they will still be roots of their own subtrees. So my suggestion would be to pick off the biggest ones and remove them from the list very quickly. You can then distribute the rest as subordinate to one (or more) of the roots you picked out. You can then (recursively) process each of those lists, to further divide their lists into smaller third level lists below a few second-level subroots. Rinse and repeat. Subnets entirely contained within a higher level can be easily discarded. The initial sorting by subnet size is very fast. And the first level of recursion very quickly splits the dataset into several or many small subsets that are quickly processed at each new level of recursion. I might have posted code, but I found that testing such is very hard in the absence of a real dataset. Randomly generated datasets are just too random to give meaningful results. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Algorithom to find overlaping subnets (Internet IPv4) by chrestomanci (Priest) on Sep 20, 2011 at 08:17 UTC
Thank you for your input, however I think you misunderstand the nature of the problem. By 'Overlapping subnets' I actually mean one subnet that is entirely within another. It is not possible for a subnet to partially overlap another because we are expressing them in CDIR notation, rather than arbitrary ranges. This makes the problem simpler than you allowed for in your analysis. For example, consider the subnet x.y.0.0/16 There are exactly 2 /17 subnets that might fit inside it (x.y.0.0/17 and x.y.128.0/17) and a greater number of smaller subnets. There might also be larger subnets that contain it, but it is impossible to define a subnet (in CDIR notation) that includes some of the address space covered by x.y.0.0/16, and some address space that is not covered. You also talk about merging overlapping subnets. This is not what I am trying to do. The end purpose is to produce a reports of all the overlaps which will be used by the network infrastructure people to reconfigure the routers and DNS/DHCP servers so that the subnets no longer overlap.	[reply]
Re^3: Algorithom to find overlaping subnets (Internet IPv4) by BrowserUk (Patriarch) on Sep 20, 2011 at 14:09 UTC
There might also be larger subnets that contain it, but it is impossible to define a subnet (in CDIR notation) that includes some of the address space covered by x.y.0.0/16, and some address space that is not covered. Indeed. I wasn't aware of that property of CIDRs, though I now see it is obvious. You also talk about merging overlapping subnets. This is not what I am trying to do. The end purpose is to produce a reports of all the overlaps which will be used by the network infrastructure people to reconfigure the routers and DNS/DHCP servers so that the subnets no longer overlap. When the infrastructure people get their hands on your report, won't one of the things they might do be to consolidate (say) 0.0.0.0/30 & 0.0.0.4/30 into 0.0.0.0/29 thus reducing router table sizes? Or dropping this lot: `0.40.0.0/15 0.40.0.0/16 0.47.0.0/16 0.42.0.0/18 0.44.128.0/18 0.47.192.0/18 0.43.0.0/19 0.47.64.0/21 0.47.192.0/21` [download] Because they are all already covered by: `0.40.0.0/13`? (ie."coalescing" them.) Anyway, thanks for posting an interesting question. I guess if you don't find Re: Algorithom to find overlaping subnets (Internet IPv4) useful, someone else might :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Algorithom to find overlaping subnets (Internet IPv4) by BrowserUk (Patriarch) on Sep 19, 2011 at 20:11 UTC
FWIW: This processes a list of 50e3 randomly generated CIDRs in < 3 minutes: #! perl -slw use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use enum qw[ CIDR NETWORK SIZE RANGE ]; use enum qw[ FIRST LAST ]; sub dd2n { unpack 'N', pack 'C4', split '\.', $_[0] } sub n2dd { join '.', unpack 'C4', pack 'N', $_[0] } sub CIDR2Range { my( $network, $bits ) = $_[ 0 ] =~ m[([^/]+)/(\d+)] or die "Bad CIDR: '$_[0]'"; my $start = dd2n( $network ); my $hostMask = ( 1 << ( 32 - $bits ) ) -1; die "Bad CIDR: '$_[0]'" if $start & $hostMask; my $end = $start \| $hostMask; return [ $start, $end ]; } sub rangesOverlap { my( $thisFirst, $thisLast ) = @{ $_[0] }; my( $thatFirst, $thatLast ) = @{ $_[1] }; return 1 unless $thisFirst > $thatLast or $thatFirst > $thisLast; } sub isContainedBy { my( $thisFirst, $thisLast ) = @{ $_[0] }; my( $thatFirst, $thatLast ) = @{ $_[1] }; return 1 if $thisFirst <= $thatFirst and $thisLast >= $thatLast; } my $start = time; chomp( my @CIDRs = <> ); my $count = @CIDRs; @CIDRs = sort { $b->[SIZE] <=> $a->[SIZE] \|\| $b->[RANGE][FIRST] <=> $a->[RANGE][FI +RST] } map[ $_, m[([^/]+)/(\d+)$], CIDR2Range( $_ ) ], @CIDRs; #pp \@CIDRs; my @forest; OUTER: while( @CIDRs ) { my $next = pop @CIDRs; my $that = $next->[RANGE]; for my $tree ( @forest ) { my $this = $tree->{root}[RANGE]; if( isContainedBy( $this, $that ) ) { push @{ $tree->{contains} }, $next; next OUTER; } } push @forest, { root => $next }; } @forest = sort{ $a->{root}[RANGE][FIRST] <=> $b->{root}[RANGE][FIRST] } @forest; for my $tree ( @forest ) { print $tree->{root}[CIDR]; print "\t", join ' ', map $_->[CIDR], @{ $tree->{contains} }; } printf STDERR "Took %.3f seconds for $count CIDRs\n", time() - $start; [download] The (truncated) output produced is: 0.0.0.0/12 0.10.0.0/15 0.1.0.0/16 0.7.0.0/16 0.12.0.0/16 0.11.0.0/17 0.4.128. +0/19 0.11.160.0/19 0.12.72.0/21 0.2.132.0/22 0.3.184.0/22 0.15.254.0/ +24 0.14.150.128/26 0.16.128.0/17 0.17.0.0/16 0.18.163.128/26 0.19.237.0/24 0.24.16.0/20 0.24.104.0/21 0.25.0.0/19 0.25.144.0/22 0.27.0.0/16 0.28.190.0/23 0.28.203.128/25 0.29.0.0/16 0.29.64.0/20 0.30.240.0/20 0.31.157.160/27 0.34.0.0/16 0.35.0.0/18 0.37.252.0/22 0.38.64.0/22 0.40.0.0/13 0.40.0.0/15 0.40.0.0/16 0.47.0.0/16 0.42.0.0/18 0.44.128.0/18 0.47 +.192.0/18 0.43.0.0/19 0.47.64.0/21 0.47.192.0/21 0.48.0.0/13 0.54.0.0/15 0.55.128.0/18 0.48.160.0/19 0.52.224.0/19 0.53.96.0/19 + 0.55.32.0/20 0.49.124.0/22 0.51.241.128/26 0.56.168.0/23 0.58.0.0/21 ... [download] I produced the datasets using this generator script: `#! perl -slw use strict; use Math::Random::MT qw[ rand srand ]; sub n2dd { join '.', unpack 'C4', pack 'N', $_[0] } our $N //= 50e3; our $S //= 1; srand( $S ); my %uniq; for( 1 .. $N ) { my $bits = 8 + int( rand( 8 ) + rand( 8 ) + rand( 8 ) ); my $ip = int( rand( 2 **32 ) ); $ip &= ~( ( 1 << ( 32 - $bits ) ) -1 ); $ip = n2dd( $ip ); my $cidr = "$ip/$bits"; redo if exists $uniq{ $cidr }; $uniq{ $cidr } = undef; print $cidr; }` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Algorithom to find overlaping subnets (Internet IPv4) by jimpudar (Pilgrim) on Apr 11, 2018 at 05:33 UTC
I recently had to work on this problem, and needed a fast implementation. I found the easiest and fastest way was to use a Trie as mentioned by rg0now. I used the random generator which BrowserUk supplied to generate the random dataset. My version processed the dataset in less than two seconds on my 3.8GHz Linux box. I don't have nice sorted output though. It should be trivial to build up a data structure as you go if you need sorted output... Here's the code: #! /usr/bin/env perl use strict; use warnings; use autodie; use 5.10.0; use Tree::Trie; use Net::IP qw(ip_splitprefix ip_iptobin ip_get_mask); my $trie = Tree::Trie->new(); while (<>) { chomp; my ($ip, $len) = ip_splitprefix($_); my $binip = ip_iptobin($_,4); my $trie_entry = substr($binip, 0, $len); $trie->deepsearch('prefix'); my $data = $trie->lookup_data($trie_entry); if ($data) { say "'$_' is contained by or equals '$data'"; } else { $trie->deepsearch('choose'); my $data = $trie->lookup_data($trie_entry); if ($data) { say "'$_' contains '$data'"; } } $trie->add_data($trie_entry, $_); } [download] Some truncated sample output: time ./trie_overlap.pl <50k_ips ... '116.15.84.0/23' is contained by or equals '116.14.0.0/15' '108.245.172.0/24' is contained by or equals '108.244.0.0/14' '13.238.173.128/25' is contained by or equals '13.224.0.0/11' '37.179.224.0/19' is contained by or equals '37.179.192.0/18' '73.147.0.0/17' is contained by or equals '73.144.0.0/12' '150.152.0.0/13' contains '150.154.0.0/18' '171.93.16.0/22' is contained by or equals '171.0.0.0/9' '80.246.128.0/21' is contained by or equals '80.240.0.0/13' real 0m1.957s user 0m1.832s sys 0m0.124s [download] Anyone see any issues with this? It hasn't been completely battle tested yet :) Best, Jim	[reply] [d/l] [select]