Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This is not precisely a Perl question, but I beg forgiveness and hope that I might find wisdom here regardless.

The problem is as follows:

Let's say we were dealing with a group of people in possession of two distinct and (for our purposes) unrelated attributes. Assuming we can consider them to be independent of one another, I'll offer height and weight as an example. I'll store these values in arrays @height and @weight, with the indices of the arrays corresponding to the individual people in the group. (So @height(0) and @weight(0) are the height and weight of the first person, and so on.) The mission: to select the two people with the greatest combined heights and weights. Given that height and weight are measured in different units, I want the solution to reflect the greatest heights and weights comparative to the rest of the group. (So person(1) with a height 20% above average and a weight 12% above average would be selected over person(2) with a height 28% above average and a weight only 2% above average.)

Also.. how could I go about scaling the solution if I wanted to select three, four, or more people from the group? What would happen if I added a third or fourth attribute to deal with?

Dealing with this efficiently has been a nightmare for me. If anyone has any references to solutions to this type of problem, I'd be greatly appreciative. Thanks in advance for the help.

-Terwin

Replies are listed 'Best First'.
Re: Algorithm concerns
by ferrency (Deacon) on May 03, 2002 at 13:46 UTC
    It might be easier if you broke your problem into several simpler parts, and then to combine the parts. For example, you could break down your problem and generalize it like this:

    • Normalize the values of each attribute into a common set of units.
    • Combine the values of all of one person's atributes to get a total for each group member.
    • Sort the members of the group by this attribute total.
    • Choose the top X members of the group.
    Algorighmically, the only thing here that isn't O(N) is the sorting. That's not so bad.

    On normalization: It really depends on the nature of your attributes, as to what kind of normalization makes sense. One easy way would be to arbitrarily assign the maximum value of any one attribute the value "1", and scale the rest of the values of that attribute from 0 to 1. But again, this probably wouldn't make sense in all cases- certainly not with heights and weights.

    As for the perlish aspects: I'd recommend storing each person as a hash, with each attribute name and value as a key-value pair in that hash. That way you can add more attributes, or data about the attributes (normalized values, sums, etc) into each person without adding more and more arrays into your code. Other than that, once you decide on a solution post some code and I'm sure someone will be happy to help :)

    Alan

      As a useful tip to doing the first of these (normalise units), take a look at Data::Dimensions
      Shortly after I posted, I had the same inclination.. normalizing, and sorting.

      I don't have a great deal of experience with normalizing data - would it be correct to say that the normalized data sets would have to have a similar distribution in order for the sum of the two values to represent a maximum?

      Or perhaps if they did not have similar distributions (For example, if height were one attribute and number of teeth were the other.. unless we're dealing with a hockey team height is going to have a much broader distribution of values), the normalized scale could somehow reflect that fact?

      Any suggestions on handling that task?

      Thanks,

      Terwin
        I'm no statistician, but my gut reaction is that you're probably comparing apples and oranges, so any comparison you come up with is not likely to be very useful. As you suggest, data distribution will play a large factor here.

        In my earlier post, I meant to use the word "normalization" in a more generic sense: instead of simply scaling the numbers so they fit into a predictable range of values, you need to convert all values to "universal units" which can be compared directly. Depending on your attributes, this could be quite complex: comparing a weight versus a height doesn't necessarily make sense, as you observed. Those "universal units" might not be inches or pounds; they might be "deviation from average" or something else which is unit-independant.

        But the problem will still be easier if you can separate this data conversion step from the other parts of the process.

        As far as getting "the right answer," I think you'd have better luck asking someone with a high level of Statistics skill.

        Alan.

Re: Algorithm concerns
by I0 (Priest) on May 03, 2002 at 23:14 UTC
    @height = (10,20,30,31,32,33); @weight = (19,18,17,16,15,14); for( (map{$_->[1]} sort{$b->[0] <=> $a->[0]} map{[$height[$_]*$weight[$_],$_]} (0..$#height))[0,1] ){ print "$_:$height[$_],$weight[$_]\n"; }