in reply to algorithm for 'best subsets'

Here's an artificial test data generator, for those of you who are interested in doing your own benchmarking before publishing your methods.
my %Items; sub build_test_data { # reproduceable case srand(12345); # Sorted by prevalence. Keyword 'kaa' is way more common than 'kz +z'. my @Keywords = 'kaa' ... 'kzz'; # Each node is associated with an asciibetical list of unique keyw +ords. # We groom out the top keywords which are basically noise. for my $xx ('iaa' .. 'izz') { my $count = int(rand(8)) + 4; $Items{$xx}{$Keywords[ int(rand()*rand()*@Keywords) ]}++ while $count--; delete $Items{$xx}{$_} for 'kaa'..'kab'; $Items{$xx} = [ sort keys %{$Items{$xx}} ]; } return unless @_; print Dumper \%Items; # lots of raw data! } build_test_data();
Update: Here's a useful results format:
tuples of 3: 6 kaa kdf kea 6 kab kaf kka 4 kad kfa kfg ... tuples of 2: 9 kad kfa 8 kaj kda 8 kaj kda ...

--
[ e d @ h a l l e y . c c ]

Replies are listed 'Best First'.
Re^2: algorithm for 'best subsets'
by fizbin (Chaplain) on Mar 03, 2005 at 16:36 UTC
    Of course, to satisfy the "tens of thousands of keywords", you really should be using 'kaaa'..'kzzz' and 'iaaa'..'izzz'.

    Of course, when I did that my method blew through all available memory when trying to compute 5-at-a-time. It did manage 4-at-a-time, though, and in under five minutes.

    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      Exactly. The test case should show that the functionality worked, and that different methods found the same subsets. It should not try to break your machine.

      In practice, I can slice the %Items and the %Keywords up a bit, and do smaller overlapping datasets. I can also farm the work out to multiple machines on these slices.

      More on what that is, tonight; I've been stuck in meetings today so haven't had the chance to really explain what all this is about.

      --
      [ e d @ h a l l e y . c c ]

        halley,
        For the record, my solution has a very small memory footprint even for really large test cases but it can be really slow to finish. A half hour with this test case for a target size of 3.

        Cheers - L~R

        A clear indication of what output/format/datastructure you are expecting would be cool.


        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.