Re^7: algorithm for 'best subsets'

It's definitely a very valuable pass to make. It cuts down the combinatorics immensely, especially (as you said) if you can remove the most common words from the picture.

I implemented it myself, because I couldn't work out how to accumulate the partiton items on the fly using Graph::UnionFind--for good reason it seems:).

My homebrew implementation finds

141 partitons in the 17576 keywords / 676 items set in .5 seconds.

292 partitons in the 436697 keywords / 676 items set in 23 seconds.

745 partitions in the 17676 keywords / 17676 items set in 18 seconds.

[22:56:47.23] P:\test>436050-2 -W=3 -I=2 -NODETAILS
Keywords: 16873
Items: 676
141 partitons found.

[22:56:47.81] P:\test>436050-2 -W=4 -I=2 -NODETAILS
Keywords: 438697
Items: 676
292 partitons found.

[22:57:10.23] P:\test>436050-2 -W=3 -I=3 -NODETAILS
Keywords: 16873
Items: 17576
745 partitons found.

[22:57:28.48] P:\test>
[download]

Which is good news because it means that it scales well in both directions.

I tried the 438897K / 438697I combination, but even with avoiding the memory requirement of Gr::UnF's hashes, it still requires more memory than I have, and my disk is badly fragged so I am defragging it before trying again.

Once you have partitioned, would you agree that it makes sense to do a pairwise combinations of the partitions and isolating those pairs within the partition that have nothing in common?

Examine what is said, not who speaks.

Silence betokens consent.

Love the truth but pardon error.

Comment on Re^7: algorithm for 'best subsets' Download Code

Replies are listed 'Best First'.
Re^8: algorithm for 'best subsets' by tall_man (Parson) on Mar 04, 2005 at 23:35 UTC
Thanks for your suggestions and independent work. Please post your home-brew solution, since my use of Graph::UnionFind is still not working.	[reply]
Re^9: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 05, 2005 at 00:09 UTC
Here's my code--it's not particularly well doc'd. The partitioning is done in the last set of nested for loops. I'm using the combined (&'d) bitmaps as the keys to the `%partns` hash and accumulating the item numbers as an array in the value. That means deleting the previous key, after transfering the value to the new key, each time a new item is added to a partitition. Is also requires a `next LABEL:` which is the first time I've ever needed to use that, but it seems to work okay. Omitting the `-NODETAILS` switch from the command line will cause `%partns` to be dumped to the screen. The `-W=w` and `-I=n` switches control the powers of 26 used for the keywords/items generation. The defaults are 2 & 2 which (I belive) is the dataset that halley offered for benchmarking, though I am building a different datstructire and (I suspect) that `rand()` may be producing different sequences on different builds/platforms? #! perl -slw use strict; use List::Util qw[ max sum ]; our $W \|\|= 2; my( $W1, $W2, $S2 ) = ( 'a'x$W, 'z'x$W, 'b'x$W ); our $I \|\|= 2; my( $I1, $I2 ) = ( 'a'x$I, 'z'x$I ); our $NODETAILS; our $SRAND \|\|= 12345; srand( $SRAND ) if $SRAND; my @allWords = 'k'.$W1 ... 'k'.$W2; ## All the +words my @stopWords = 'k'.$W1 ..'k'.$S2; ## Those rej +ected as 'noise' my @keyWords = @allWords[ @stopWords .. @allWords ]; ## Only the on +es we are interested in print "Keywords: " . @keyWords; ## The item names my @items = 'i'.$I1 .. 'i'.$I2; print "Items: " . @items; ## Builds the same dataset as the example datasets generator ## Just represents it in a more compact and malleable form. my @itemMap; for my $item ( 0 .. $#items ) { my $count = 4 + int rand( 8 * (1+ (($W-2) * 26)) ); my $vector = ''; vec( $vector, int( rand() * rand() * @keyWords ), 1 ) = 1 while $c +ount--; $itemMap[ $item ] = $vector; } ## Partition items into nonoverlapping sets my %partns; OUTER: for my $itemno ( 0 .. $#items ) { for my $partn ( keys %partns ) { next unless $partns{ $partn }; ## Skip old ones my $common = unpack '%b*', ( $itemMap[ $itemno ] & $partn ); next unless $common; my $newPartn = $partn & $itemMap[ $itemno ];## Form a new part +ition key $partns{ $newPartn } = $partns{ $partn }; ## move over the +old item numbers push @{ $partns{ $newPartn } }, $itemno; ## Add the new one delete $partns{ $partn } unless $partn eq $newPartn; ## del +ete the old partiton next OUTER; } ## If we got here, this item had no keywords in common with any ex +isting partition ## So start a new one $partns{ $itemMap[ $itemno ] } = [ $itemno ]; } unless( $NODETAILS ) { print "[ @$_ ]" for values %partns; } print scalar keys %partns, " partitons found."; exit; [download] Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l] [select]
Re^9: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 08, 2005 at 01:19 UTC
I assume you noticed that I was ANDing when I should have been ORing? Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]