Re: Help thinking about an alternate algorithm
by BrowserUk (Patriarch) on Jan 15, 2014 at 16:19 UTC
|
- Essentially, you've describe a string comparison problem.
Where each 'letter' of the string is a category and thus each letter has a different range of possibilities, and the number of categories is the length of the string.
I need to determine the furthest distance two selections are from a list of possible selections.
That bit -- two selections from a list of possible selection -- is muddy. But if we ignore that and only consider this bit:
In other words, comparing every selection in the list against every other selection in the list.
What you are describing is a full cross-comparison using a variation on the 'edit distance'.
The bottom line for which I think is that because you are seeking a relative measure, there is no way to avoid a full cross comparison.
This is because the difference between the first letters is equally significant to those between the last; thus any attempt to sort the strings; or to reduce the strings to numerical values so that they may be ordered with a view to reducing the number of comparisons; fails.
Maybe if you encode each category as a set of bits, and combine those bits into bit strings, the individual comparisons reduces to a matter of xoring the pairs of strings and counting the number of bits in the result.
That should prove to be substantially faster than 5 to 10 individual subtractions and summing of the results.
But I think the best you can hope for is to improve the per-pair comparison cost; rather than there being any way to reduce the number of per-pair comparisons that must be done. (I think :)
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
BrowserUk,
I have already maximized the efficiency in determining the distance between two selections using bitmasks. As stated, I know that full cross comparison is necessary in some cases. I was just hoping that I could prune that full cross comparison in cases where I knew a compare couldn't lead to a higher distance than I already have. There is one trivial case where this short circuiting works (when 2 items are the maximum distance possible). I was hoping there might be others assuming the data will be mostly similar.
| [reply] |
|
|
There is one trivial case where this short circuiting works (when 2 items are the maximum distance possible).
"two items"? Are items people? Or categories?
What is the "maximum distance"? No categories the same. How can you know that until you have compared all categories for the given pair? What are you short circuiting?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Help thinking about an alternate algorithm
by LanX (Saint) on Jan 15, 2014 at 15:07 UTC
|
Not sure I get the problem:
> I need to determine the furthest distance two selections are from a list of possible selections.
You have an AoA (set of selections) and need to know a pair of selections with maximum possible distance?
And this set doesn't include all possible selections?
I.e. the elements ['jeans','shirt'] and ['skirt', 'sweater'] don't imply the existence of ['jeans','sweater'] in the set ?
And the distance is the number of different items in a selection?
> What ideas do you have?
self correcting codes ( see Hamming distance ) come into mind, but I dunno how to apply them here.
Cheers Rolf
( addicted to the Perl Programming Language)
| [reply] [d/l] [select] |
|
|
LanX,
Not sure I get the problem
I really like the outfit analogy because I can't discuss the real data (work). Imagine you have a classroom of kids. Each kid is told that the next day they must all wear a hat, a shirt, a pair of pants, a belt and a pair of shoes. They may choose to wear a either a red, white or a blue hat. The choices for shirt are larger (red, white, blue, green, yellow and orange), etc.
The number of students will always be small in comparison to the number of distinct possible outfits. What you need to do is identify the two students who had the largest number of differences between them. In my example, the greatest distance is 5 (hat, shirt, pants, belt and shoes). It doesn't matter how many items are in a particular category - if they chose a different color hat then the distance is 1, if they also chose a different color belt the distance goes up to 2.
As I have said, I know that comparing all students against all other students is necessary in the worst case. I was hoping there would be a way to say I currently have found two students with at least 3 things different so I shouldn't consider students who have at least 2 things the same because they can't possibly exceed my current high water mark.
| [reply] |
|
|
> Each kid is told that the next day they must all wear a hat, a shirt, a pair of pants, a belt and a pair of shoes.
So each kid has a coordinate in a 5 dimensional space
hat X shirt X pants X belt X shoes
You want to know the maximum Hamming Distance between of two kids.
> I was hoping there would be a way to say I currently have found two students with at least 3 things different so I shouldn't consider students who have at least 2 thing ...
One naive approach is to sort is to partition all students according to each dimension such that
union ( @hat{ qw/red white blue/ } ) == @all_students and so on.
those partitions are the layers in a 5-dimensional cube.
for one student A=(a_1,a_2,a_3,a_4,a_5) the maximum distance students are those in the intersection of @Not(a_n) for the smallest n.
> current high water mark.
well n has to be greater than your current maximum.
I'm not very keen to code this, just wanna give you pointers for keywords to find appropriate algorithms.
Another thing which comes to mind are implication bases.
Something like "all kids with red hat and yellow shirt wear green pants" is an implication.
The minimal set of implications which can be used to derive all other possible implications is called an "implication base".
If all other students are in a 3 distance to A. this means that any 3 features A doesn't have already imply a feature A has.
There are many algorithms effectively calculating such implications for a given point set, I wouldn't be surprised if they already lead to the maximum hamming distance as a side effect.
AFAIK "Implication bases" are used for DB optimization.
I'd start looking there.¹
HTH! =)
Cheers Rolf
( addicted to the Perl Programming Language)
¹)I wouldn't be surprised if modifying next closure-algorithm of Bernhard Ganter can already be used for this.
| [reply] [d/l] [select] |
|
|
I was hoping there would be a way to say I currently have found two students with at least 3 things different so I shouldn't consider students who have at least 2 things the same because they can't possibly exceed my current high water mark.
To know that there are students that have at least 2 things different, you have to have already started comparing them.
But you won't know if they have more than 2 things different until you have compared all the categories.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Help thinking about an alternate algorithm
by Laurent_R (Canon) on Jan 15, 2014 at 18:42 UTC
|
Well, I think that you can do some short-circuiting. Assume we have five items and that, at some point in your process, you have found that the maximum distance is at least 4 (e.g. you've found at least one pair of students that have only one item in common,). Then, when you are comparing some other student pair, if you find two items in common, you can stop checking the other items, you know that you're not going to find a better distance (well, in fact, you can do that even earlier: as soon as you have found one common item, you're not gonna find a larger distance). In other words, you would still have to go through every possible pair, but you can often cut down the number of items that you will need to compare. Not sure this helps much, though, it depends on how you implemented your algorithm. And maybe you already figured that out and I am just stating the obvious (please forgive me if such is the case), but I still wanted to say it just in case it might help.
| [reply] |
|
|
This is similar to the "alpha-beta cutoff" that is used say in computer-game play. As soon as you determine that a player would be worse-off than you already know he could be, then you stop searching because you don't care how much worse-off he would be. But I am not sure that this optimization would really apply because "the score of any particular kid" depends entirely upon which other kid you are comparing him to. It is a hamming-distance problem.
| [reply] |
|
|
| [reply] |
Re: Help thinking about an alternate algorithm
by VinsWorldcom (Prior) on Jan 15, 2014 at 21:51 UTC
|
I normally never chime in on Limbic~Region questions as they're usually way over my head, but this one was intriguing.
Assuming arrays of students (from your classroom example) describing their attire selection as such:
my @students = (
[qw(cap shortsleeve jeans leather flipflop)],
[qw(beenie longsleeve shorts cloth highheel)],
[qw(cap tanktop shorts rope flat)],
[qw(beenie longsleeve shorts cloth flipflop)]
);
You can transpose the @students array and compare each individual category (hat, shirt, ...). The combinations that show up:
Hat: 0,2 and 1,3
Shirt: 1,3
...
add to a count for that student pair. The higher the count for the student pair, the more similar they are, the shorter "distance".
You should only need to look at the combinations, in this case 0,1; 0,2; 0,3; 1,2; 1,3; 2,3. The combination with the lowest count is your "winner".
Again, this is a bit over my head, so maybe what I'm proposing is the wrong approach or just as "bad" (if not worse) than your worst case solution.
UPDATE: I have code but am reluctant to post if it's a suboptimal approach and will lead others astray.
UPDATE: Code added below.
| [reply] [d/l] [select] |
|
|
| |
|
|
| [reply] |
Re: Help thinking about an alternate algorithm
by hdb (Monsignor) on Jan 17, 2014 at 08:35 UTC
|
So far all proposals seem to be O(N^2) and I cannot think of anything that has not a worst-case performance of that order. Stealing VinsWorldcoms example, one implementation would be
use strict;
use warnings;
my @students = (
[qw(cap shortsleeve jeans leather flipflop)],
[qw(beenie longsleeve shorts cloth highheel)],
[qw(cap tanktop shorts rope flat)],
[qw(beenie longsleeve shorts cloth flipflop)]
);
my @best = ( -1 );
while( my $x = shift @students ) {
Y: for my $y (@students) {
my $diff = @$x;
for( my $i = 0; $i<@$x; $i++ ) {
$diff-- if $x->[$i] eq $y->[$i];
next Y if $diff <= $best[0];
}
@best = ( $diff, $x, $y );
}
}
print "@$_\n" for @best[1,2];
Not sure this is an improvement over what you have already.
I am still thinking whether any kind of sorting will improve performance. The aim would be to trigger the shortcut from the loop most often. It exits early if two arrays are similar but ONLY if one has already encountered two very different ones. My intuition is to say, one should put the columns with the least variation first, and also the rows with the most frequent items. But I cannot get to grips why this should be so....
| [reply] [d/l] |
Re: Help thinking about an alternate algorithm
by SuicideJunkie (Vicar) on Jan 16, 2014 at 15:16 UTC
|
What if you spend extra memory to partially process each pair in the order of how likely they are to be the best pair?
Each pair will need to track where it is in the sequence of categories to check next.
- for my $pair (@contenders).
- If $pair->{nextCategory} >= @categories then you have found the best.
- Otherwise, test the category indicated by $pair->{nextCategory}.
- Increment the pair's nextCategory.
- If the pair didn't match, redo.
- If they do match, push the pair onto a @reserve and try the next contender.
- Once the @contenders are exhausted, repopulate it with the @reserve of pairs that had one more match than the previous contenders.
| [reply] [d/l] [select] |
Re: Help thinking about an alternate algorithm
by GrandFather (Saint) on Jan 16, 2014 at 22:41 UTC
|
Can you provide a small sample data set that reflects the distribution of combinations in your typical real data? Of particular interest: are combinations likely to be fairly unique (hardly anyone wears the same hat for example), or does everyone follow the fashion leaders so there are a few fairly unique combinations and a lot of very similar "me too" combinations, or maybe fashions are regimented so pretty much everyone wears the same uniform with only small variation?
True laziness is hard work
| [reply] |
Re: Help thinking about an alternate algorithm
by LanX (Saint) on Jan 16, 2014 at 13:57 UTC
|
Now to something completely different.
Did you think about the inverted problem to construct sets with a maximal hamming distance?
Those sets are much smaller, i.e. any n-set has either
a span of n or is reasonably small.
For instance the biggest 3-set of span 2 is
AAA
ABB
BAB
AAB
And it doesn't matter how variable your selection is, another C or D at one position doesn't change the fact that any set with more than 4 selections must have at least 2 selections with distance 3. ¹(see update)!
Look for "hamming codes" for formulas to estimate the maximum bound.
So no need to test all pairs if the number gets bigger.
This should shorten calculations considerably...
HTH! :)
Cheers Rolf
( addicted to the Perl Programming Language)
¹) hmm sorry there is one edge case I forgot...
AAA
ABB
ACC
ADD
AEE
...
But you can easily see that the position with the smallest variety already limits the maximal set. And it's very easy to check. | [reply] [d/l] [select] |
Re: Help thinking about an alternate algorithm
by VinsWorldcom (Prior) on Jan 16, 2014 at 13:11 UTC
|
| [reply] |