http://qs1969.pair.com?node_id=182771


in reply to Sorting by geographical proximity / clumping groups of items based on X and Y

I'd say that the problem is a bit underspecified. Two trivial solution spring into mind, both more or less fitting the specification. Obviously, neither extreme is what you want, but what do you want? Are there any limits on the number of clumps, the number of complaints per clumb, the maximum distance between any pair of complaints per clumb? Do you have an expressions in terms of complaints, clumbs, distances that you want to minimize?

Even if you say something like that you want sqrt(N) of clumps, there will be solutions that you probably do not wish - you could make a scanline parallel to the Y-axis and make a sweep. Each time your scanline hits a complaint, you increment a counter. If the counter hits k * sqrt(N), the last sqrt(N) encountered complaints are put together into a clump.

The hardest part is to come up with the specifictation: what do you want to minimize/maximize/minmax/maxmin? That will determine whether this problem is trivial to solve, or very hard. I've already shown trivial solutions, but it isn't hard to come up with specification that will quickly reduce problems like the travelling salesman, or 0-1 knapsack to an instance of the described problem.

Abigail

  • Comment on Re: Sorting by geographical proximity / clumping groups of items based on X and Y

Replies are listed 'Best First'.
Re: Re: Sorting by geographical proximity / clumping groups of items based on X and Y
by vroom (His Eminence) on Jul 18, 2002 at 12:37 UTC
    I would say the most likely definition would be that all items within a clump are at most X feet from any other item in the clump. X would probably have a default of 300 feet or so but could be user-definable.
      Despite what I just said about NP-completeness, the following algorithm might give a reasonable solution (certainly not optimal).
      1. Let C be the set of all complaints.
      2. For each complaint c in C find the set S(c) of all complaints d in C such that the distance between c and d is less than X (X is user defined).
      3. Find complaint e in C such that |S(e)| <= |S(f)| for all f in C; that is, find the complaint who has the most other complaints nearby. Pick a random one in case of a tie.
      4. Make a clump out of e and S(e).
      5. Remove all complaints g in S(e) from C. For all h remaining in C, remove from S(h) all g in S(e).
      6. If C is empty, we're done. Else, goto 3.
      Some pseudo code:
      # Get set of all complaints. my @C = get_all_complaints; # Find all the associated sets. my %D = map {my $c = $_; $c => {map {$_ => 1} grep {$c ne $_ && distance ($c, $_) < $X} @C}} @C; while (%D) { # Find complaint with the most nearby. my ($complaint, $size) = (undef, 0); while (my ($c, $set) = each %D) { ($complaint = $c, $size = keys %$set) if keys %$set > $size; } # Found largest, make a clump. make_clump ($complaint, @{$D {$complaint}}); # Delete largest from set. my $set = delete $D {$complaint}; # Delete associated set from set. delete @D {keys %$set}; # Delete associated set from associated sets. delete @{$_} {keys %$set} for values %D; }
      The performance will be quadratic, I'm afraid.

      Abigail

        Shouldn't step 3 in your algorithm description:
        |S(e)| <= |S(f)|
        read
        |S(e)| > |S(f)|
        ? (Not having been a Math major, of course, could lead me to misinterpret "|S(t)|", which here I'm interpreting as "the size of set t".)
        Anyway, I had intended to suggest an algorithm dealing with sorting based on the lengths of line segments, but I now realize that's probably not much use in this case (since vroom wants to find the clumps, not necessarily a point). And reading further down the page, I like Ntromda's idea, but I, like him/her, wouldn't know how to begin coding it (without thinking about it for a long while...).
      But you need more than that, otherwise the "Put each complaint in its own clump" is still a solution. It's tempting to add the restriction that you want to minimize the number of clumps, but that smells very much to an NP-complete covering problem.

      Abigail