Re: algorithm for 'best subsets'

Here's an artificial test data generator, for those of you who are interested in doing your own benchmarking before publishing your methods.

my %Items;

sub build_test_data
{
    # reproduceable case
    srand(12345);

    # Sorted by prevalence.  Keyword 'kaa' is way more common than 'kz
+z'.
    my @Keywords = 'kaa' ... 'kzz';

    # Each node is associated with an asciibetical list of unique keyw
+ords.
    # We groom out the top keywords which are basically noise.
    for my $xx ('iaa' .. 'izz')
    {
    my $count = int(rand(8)) + 4;
    $Items{$xx}{$Keywords[ int(rand()*rand()*@Keywords) ]}++
        while $count--;
    delete $Items{$xx}{$_} for 'kaa'..'kab';
    $Items{$xx} = [ sort keys %{$Items{$xx}} ];
    }

    return unless @_;
    print Dumper \%Items; # lots of raw data!
}

build_test_data();
[download]

Update: Here's a useful results format:

tuples of 3:
6 kaa kdf kea
6 kab kaf kka
4 kad kfa kfg
 ...
tuples of 2:
9 kad kfa
8 kaj kda
8 kaj kda
 ...
[download]

--
[ e d @ h a l l e y . c c ]

Comment on Re: algorithm for 'best subsets' Select or Download Code

Replies are listed 'Best First'.
Re^2: algorithm for 'best subsets' by fizbin (Chaplain) on Mar 03, 2005 at 16:36 UTC
Of course, to satisfy the "tens of thousands of keywords", you really should be using `'kaaa'..'kzzz'` and `'iaaa'..'izzz'`. Of course, when I did that my method blew through all available memory when trying to compute 5-at-a-time. It did manage 4-at-a-time, though, and in under five minutes. `-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/` [download]	[reply] [d/l] [select]
Re^3: algorithm for 'best subsets' by halley (Prior) on Mar 03, 2005 at 21:19 UTC
Exactly. The test case should show that the functionality worked, and that different methods found the same subsets. It should not try to break your machine. In practice, I can slice the %Items and the %Keywords up a bit, and do smaller overlapping datasets. I can also farm the work out to multiple machines on these slices. More on what that is, tonight; I've been stuck in meetings today so haven't had the chance to really explain what all this is about. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re^4: algorithm for 'best subsets' by Limbic~Region (Chancellor) on Mar 03, 2005 at 21:27 UTC
halley, For the record, my solution has a very small memory footprint even for really large test cases but it can be really slow to finish. A half hour with this test case for a target size of 3. Cheers - L~R	[reply]
Re^5: algorithm for 'best subsets' by fizbin (Chaplain) on Mar 03, 2005 at 22:57 UTC
Re^4: algorithm for 'best subsets' by BrowserUk (Patriarch) on Mar 03, 2005 at 22:04 UTC
A clear indication of what output/format/datastructure you are expecting would be cool. Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]