Re^4: algorithm for 'best subsets'

The first set of bit vectors was not used in the code I posted, but they are used in another version of the code I was working on. If I operate on them in the same way I do the second set of bit vectors, I should get a complete set of the items that were combined into each partition. It's useful for testing and for using the partitions later.

I suppose you could use Graph::UnionFind by itself, but it would be horrendously inefficient. You would have to compare each pair of items to see if they intersected -- O(N^2). With my method (once I get it working right) I only have to test against the combined bit-set of each partition -- O(N*P) where P is the average number of partitions.

If you really want to scale to tens of thousands of items, I think this is the right approach. However, trying to work from the outside of Graph::UnionFind was probably a mistake. I need to create a modified version that stores the bit-vectors as part of the internal structure. Then, as it re-organizes itself, it would be able to keep the information accurately.

UnionFind uses a forest of n-ary trees which have only parent pointers. You can easily go from a child to the root, but not vice-versa. It can also "flatten" the tree by making more of the pointers go directly to the top. I believe the flattening is what is causing me to lose information.

There are many references to the "union find" algorithm, some with animated Java applets. Here is one.

Comment on Re^4: algorithm for 'best subsets'

Replies are listed 'Best First'.
Re^5: algorithm for 'best subsets' by halley (Prior) on Mar 05, 2005 at 15:57 UTC
Pseudocode for my version of the algorithm (without using G::UF or any graph concept at all), if I understand it correctly. By "kbits" I mean a keyword bit vector. By "parts" I mean partitions. Am I missing something? `# bag of parts # for each item # for each existing part # if this item's kbits intersect this part's kbits, # union up the keywords # for all other parts # if this part's kbits intersect that part's kbits, # merge that part into this part # prune parts emptied by merger # create a new part if no intersections found` [download] It seems to work, and scans my whole current database of 5810 keywords in 6628 items in about three seconds. Unfortunately, it grows to about 5 partitions maximum, and by the time it's done, it has merged back everything into one partition. I think that's the fault of my keywords pruning, though. Even though I filter out the 100 most boring prepositions and articles, I need to find out the remaining words that cause the most mergers... Update: How depressing. Not only is `'war'` the most common keyword in modern history, but it appears to be the common thread amongst all of the events as well; removing that one keyword broke the historical context into five separate partitions. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re^6: algorithm for 'best subsets' by tall_man (Parson) on Mar 06, 2005 at 13:43 UTC
Ok, halley, it looks like you're right. Since we can "or" the bit sets, UnionFind is redundant in this case. UnionFind is meant for the case when one must build up the partitions from a list of edges alone. Since we don't know the edges and we have an easy set partition membership test, your algorithm is preferable. I'm glad this was useful to break down your problem, even if it turned up a depressing word.	[reply]