in reply to Algorithom to find overlaping subnets (Internet IPv4)

FWIW: This processes a list of 50e3 randomly generated CIDRs in < 3 minutes:

#! perl -slw use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use enum qw[ CIDR NETWORK SIZE RANGE ]; use enum qw[ FIRST LAST ]; sub dd2n { unpack 'N', pack 'C4', split '\.', $_[0] } sub n2dd { join '.', unpack 'C4', pack 'N', $_[0] } sub CIDR2Range { my( $network, $bits ) = $_[ 0 ] =~ m[([^/]+)/(\d+)] or die "Bad CIDR: '$_[0]'"; my $start = dd2n( $network ); my $hostMask = ( 1 << ( 32 - $bits ) ) -1; die "Bad CIDR: '$_[0]'" if $start & $hostMask; my $end = $start | $hostMask; return [ $start, $end ]; } sub rangesOverlap { my( $thisFirst, $thisLast ) = @{ $_[0] }; my( $thatFirst, $thatLast ) = @{ $_[1] }; return 1 unless $thisFirst > $thatLast or $thatFirst > $thisLast; } sub isContainedBy { my( $thisFirst, $thisLast ) = @{ $_[0] }; my( $thatFirst, $thatLast ) = @{ $_[1] }; return 1 if $thisFirst <= $thatFirst and $thisLast >= $thatLast; } my $start = time; chomp( my @CIDRs = <> ); my $count = @CIDRs; @CIDRs = sort { $b->[SIZE] <=> $a->[SIZE] || $b->[RANGE][FIRST] <=> $a->[RANGE][FI +RST] } map[ $_, m[([^/]+)/(\d+)$], CIDR2Range( $_ ) ], @CIDRs; #pp \@CIDRs; my @forest; OUTER: while( @CIDRs ) { my $next = pop @CIDRs; my $that = $next->[RANGE]; for my $tree ( @forest ) { my $this = $tree->{root}[RANGE]; if( isContainedBy( $this, $that ) ) { push @{ $tree->{contains} }, $next; next OUTER; } } push @forest, { root => $next }; } @forest = sort{ $a->{root}[RANGE][FIRST] <=> $b->{root}[RANGE][FIRST] } @forest; for my $tree ( @forest ) { print $tree->{root}[CIDR]; print "\t", join ' ', map $_->[CIDR], @{ $tree->{contains} }; } printf STDERR "Took %.3f seconds for $count CIDRs\n", time() - $start;

The (truncated) output produced is:

0.0.0.0/12 0.10.0.0/15 0.1.0.0/16 0.7.0.0/16 0.12.0.0/16 0.11.0.0/17 0.4.128. +0/19 0.11.160.0/19 0.12.72.0/21 0.2.132.0/22 0.3.184.0/22 0.15.254.0/ +24 0.14.150.128/26 0.16.128.0/17 0.17.0.0/16 0.18.163.128/26 0.19.237.0/24 0.24.16.0/20 0.24.104.0/21 0.25.0.0/19 0.25.144.0/22 0.27.0.0/16 0.28.190.0/23 0.28.203.128/25 0.29.0.0/16 0.29.64.0/20 0.30.240.0/20 0.31.157.160/27 0.34.0.0/16 0.35.0.0/18 0.37.252.0/22 0.38.64.0/22 0.40.0.0/13 0.40.0.0/15 0.40.0.0/16 0.47.0.0/16 0.42.0.0/18 0.44.128.0/18 0.47 +.192.0/18 0.43.0.0/19 0.47.64.0/21 0.47.192.0/21 0.48.0.0/13 0.54.0.0/15 0.55.128.0/18 0.48.160.0/19 0.52.224.0/19 0.53.96.0/19 + 0.55.32.0/20 0.49.124.0/22 0.51.241.128/26 0.56.168.0/23 0.58.0.0/21 ...

I produced the datasets using this generator script:

#! perl -slw use strict; use Math::Random::MT qw[ rand srand ]; sub n2dd { join '.', unpack 'C4', pack 'N', $_[0] } our $N //= 50e3; our $S //= 1; srand( $S ); my %uniq; for( 1 .. $N ) { my $bits = 8 + int( rand( 8 ) + rand( 8 ) + rand( 8 ) ); my $ip = int( rand( 2 **32 ) ); $ip &= ~( ( 1 << ( 32 - $bits ) ) -1 ); $ip = n2dd( $ip ); my $cidr = "$ip/$bits"; redo if exists $uniq{ $cidr }; $uniq{ $cidr } = undef; print $cidr; }

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Algorithom to find overlaping subnets (Internet IPv4)
by jimpudar (Pilgrim) on Apr 11, 2018 at 05:33 UTC

    I recently had to work on this problem, and needed a fast implementation.

    I found the easiest and fastest way was to use a Trie as mentioned by rg0now.

    I used the random generator which BrowserUk supplied to generate the random dataset.

    My version processed the dataset in less than two seconds on my 3.8GHz Linux box.

    I don't have nice sorted output though. It should be trivial to build up a data structure as you go if you need sorted output...

    Here's the code:

    #! /usr/bin/env perl use strict; use warnings; use autodie; use 5.10.0; use Tree::Trie; use Net::IP qw(ip_splitprefix ip_iptobin ip_get_mask); my $trie = Tree::Trie->new(); while (<>) { chomp; my ($ip, $len) = ip_splitprefix($_); my $binip = ip_iptobin($_,4); my $trie_entry = substr($binip, 0, $len); $trie->deepsearch('prefix'); my $data = $trie->lookup_data($trie_entry); if ($data) { say "'$_' is contained by or equals '$data'"; } else { $trie->deepsearch('choose'); my $data = $trie->lookup_data($trie_entry); if ($data) { say "'$_' contains '$data'"; } } $trie->add_data($trie_entry, $_); }

    Some truncated sample output:

    time ./trie_overlap.pl <50k_ips ... '116.15.84.0/23' is contained by or equals '116.14.0.0/15' '108.245.172.0/24' is contained by or equals '108.244.0.0/14' '13.238.173.128/25' is contained by or equals '13.224.0.0/11' '37.179.224.0/19' is contained by or equals '37.179.192.0/18' '73.147.0.0/17' is contained by or equals '73.144.0.0/12' '150.152.0.0/13' contains '150.154.0.0/18' '171.93.16.0/22' is contained by or equals '171.0.0.0/9' '80.246.128.0/21' is contained by or equals '80.240.0.0/13' real 0m1.957s user 0m1.832s sys 0m0.124s

    Anyone see any issues with this? It hasn't been completely battle tested yet :)

    Best,

    Jim