in reply to Re^5: algorithm for 'best subsets'
in thread algorithm for 'best subsets'

It wouldn't be easy to accumulate the item numbers, because the data structure for UnionFind keeps changing as I go. At any given point, I can find out to which partition a vertex belongs to by looking it up with "find". So the easiest way to get the partitions is to gather them up at the end.

However, I am still seeing a bug. Some of my "one-item" partitions seem to have shared keys with other partitions. Cases near the beginning of the item set (like "iaac") tend to have this problem. Trying to gather the items into partitions as I go along might help me find the bug.

The purpose of this pass is to find all the completely distinct partitions. There's no point in looking for n-ary unions among things that have no bits in common. So I would apply the original algorithm to each subset.

This pass will also help halley decide if the keywords are too generic. If it's all one big clump, there may be too many common words in the set.

Replies are listed 'Best First'.
Re^7: algorithm for 'best subsets'
by BrowserUk (Patriarch) on Mar 04, 2005 at 23:27 UTC

    It's definitely a very valuable pass to make. It cuts down the combinatorics immensely, especially (as you said) if you can remove the most common words from the picture.

    I implemented it myself, because I couldn't work out how to accumulate the partiton items on the fly using Graph::UnionFind--for good reason it seems:).

    My homebrew implementation finds

  • 141 partitons in the 17576 keywords / 676 items set in .5 seconds.
  • 292 partitons in the 436697 keywords / 676 items set in 23 seconds.
  • 745 partitions in the 17676 keywords / 17676 items set in 18 seconds.
    [22:56:47.23] P:\test>436050-2 -W=3 -I=2 -NODETAILS Keywords: 16873 Items: 676 141 partitons found. [22:56:47.81] P:\test>436050-2 -W=4 -I=2 -NODETAILS Keywords: 438697 Items: 676 292 partitons found. [22:57:10.23] P:\test>436050-2 -W=3 -I=3 -NODETAILS Keywords: 16873 Items: 17576 745 partitons found. [22:57:28.48] P:\test>

    Which is good news because it means that it scales well in both directions.

    I tried the 438897K / 438697I combination, but even with avoiding the memory requirement of Gr::UnF's hashes, it still requires more memory than I have, and my disk is badly fragged so I am defragging it before trying again.

    Once you have partitioned, would you agree that it makes sense to do a pairwise combinations of the partitions and isolating those pairs within the partition that have nothing in common?


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
      Thanks for your suggestions and independent work. Please post your home-brew solution, since my use of Graph::UnionFind is still not working.

        Here's my code--it's not particularly well doc'd.

        The partitioning is done in the last set of nested for loops. I'm using the combined (&'d) bitmaps as the keys to the %partns hash and accumulating the item numbers as an array in the value.

        That means deleting the previous key, after transfering the value to the new key, each time a new item is added to a partitition. Is also requires a next LABEL: which is the first time I've ever needed to use that, but it seems to work okay.

        Omitting the -NODETAILS switch from the command line will cause %partns to be dumped to the screen.

        The -W=w and -I=n switches control the powers of 26 used for the keywords/items generation.

        The defaults are 2 & 2 which (I belive) is the dataset that halley offered for benchmarking, though I am building a different datstructire and (I suspect) that rand() may be producing different sequences on different builds/platforms?

        #! perl -slw use strict; use List::Util qw[ max sum ]; our $W ||= 2; my( $W1, $W2, $S2 ) = ( 'a'x$W, 'z'x$W, 'b'x$W ); our $I ||= 2; my( $I1, $I2 ) = ( 'a'x$I, 'z'x$I ); our $NODETAILS; our $SRAND ||= 12345; srand( $SRAND ) if $SRAND; my @allWords = 'k'.$W1 ... 'k'.$W2; ## All the +words my @stopWords = 'k'.$W1 ..'k'.$S2; ## Those rej +ected as 'noise' my @keyWords = @allWords[ @stopWords .. @allWords ]; ## Only the on +es we are interested in print "Keywords: " . @keyWords; ## The item names my @items = 'i'.$I1 .. 'i'.$I2; print "Items: " . @items; ## Builds the same dataset as the example datasets generator ## Just represents it in a more compact and malleable form. my @itemMap; for my $item ( 0 .. $#items ) { my $count = 4 + int rand( 8 * (1+ (($W-2) * 26)) ); my $vector = ''; vec( $vector, int( rand() * rand() * @keyWords ), 1 ) = 1 while $c +ount--; $itemMap[ $item ] = $vector; } ## Partition items into nonoverlapping sets my %partns; OUTER: for my $itemno ( 0 .. $#items ) { for my $partn ( keys %partns ) { next unless $partns{ $partn }; ## Skip old ones my $common = unpack '%b*', ( $itemMap[ $itemno ] & $partn ); next unless $common; my $newPartn = $partn & $itemMap[ $itemno ];## Form a new part +ition key $partns{ $newPartn } = $partns{ $partn }; ## move over the +old item numbers push @{ $partns{ $newPartn } }, $itemno; ## Add the new one delete $partns{ $partn } unless $partn eq $newPartn; ## del +ete the old partiton next OUTER; } ## If we got here, this item had no keywords in common with any ex +isting partition ## So start a new one $partns{ $itemMap[ $itemno ] } = [ $itemno ]; } unless( $NODETAILS ) { print "[ @$_ ]" for values %partns; } print scalar keys %partns, " partitons found."; exit;

        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.

        I assume you noticed that I was ANDing when I should have been ORing?


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.