An interesting question is what exactly do you mean by overlap?

For example: for some applications, a subnet that is entirely contained by another:

subnet 1: S..........E subnet 2: S.....E

Can easily be done away with entirely.

But subnets that overlaps but not completely:

subnet 1: S..........E subnet 2: S.....E subnet 3: S.............E subnet 4: S......................E

Will rarely be able to be coalesced directly into a single subnet (#3), as the 'nearest' subnet that would contain both (#4) will usually also contain addresses not contained in the original set.

And given 50_000 inputs, the likely scenario -- in the absence of more specificity regarding the distribution of the subnets -- is that they will form a tree with a few large, 'root' level subnets each containing a hierarchy of smaller subnets:

s...................................e s......e s........... +....e s...............e s........e s.e s...........e s.e s.. +...e s......e s....e s......e

That suggests a strategy whereby instead of sorting the subnets by start/end address, you should sort them by subnet size. The first (largest) therefore will not be contained by any of the others, so can be removed from the list, and used as the root of a tree. It may of course, overlap with one or more of the next few largest, but except for the rare event where the two can be combined into a single, unextended subnet, they will still be roots of their own subtrees.

So my suggestion would be to pick off the biggest ones and remove them from the list very quickly. You can then distribute the rest as subordinate to one (or more) of the roots you picked out. You can then (recursively) process each of those lists, to further divide their lists into smaller third level lists below a few second-level subroots. Rinse and repeat.

Subnets entirely contained within a higher level can be easily discarded.

The initial sorting by subnet size is very fast. And the first level of recursion very quickly splits the dataset into several or many small subsets that are quickly processed at each new level of recursion.

I might have posted code, but I found that testing such is very hard in the absence of a real dataset. Randomly generated datasets are just too random to give meaningful results.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Algorithom to find overlaping subnets (Internet IPv4) by BrowserUk
in thread Algorithom to find overlaping subnets (Internet IPv4) by chrestomanci

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.