It's definitely a very valuable pass to make. It cuts down the combinatorics immensely, especially (as you said) if you can remove the most common words from the picture.
I implemented it myself, because I couldn't work out how to accumulate the partiton items on the fly using Graph::UnionFind--for good reason it seems:).
My homebrew implementation finds
[22:56:47.23] P:\test>436050-2 -W=3 -I=2 -NODETAILS Keywords: 16873 Items: 676 141 partitons found. [22:56:47.81] P:\test>436050-2 -W=4 -I=2 -NODETAILS Keywords: 438697 Items: 676 292 partitons found. [22:57:10.23] P:\test>436050-2 -W=3 -I=3 -NODETAILS Keywords: 16873 Items: 17576 745 partitons found. [22:57:28.48] P:\test>
Which is good news because it means that it scales well in both directions.
I tried the 438897K / 438697I combination, but even with avoiding the memory requirement of Gr::UnF's hashes, it still requires more memory than I have, and my disk is badly fragged so I am defragging it before trying again.
Once you have partitioned, would you agree that it makes sense to do a pairwise combinations of the partitions and isolating those pairs within the partition that have nothing in common?
In reply to Re^7: algorithm for 'best subsets'
by BrowserUk
in thread algorithm for 'best subsets'
by halley
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |