Re: algorithm for 'best subsets'

As for algorithm, it seems like you are just looking for large correlations among keywords. Somtimes people look for large coorelations among keywords as a marker for eliminating redundant variables. If A and B always appear together, you don't need both. Such redunandcy reduction is often used in statisitcal modeling to arrive at an independent set of variables upon which to base a model.

As for code to implement this measure, it is simple to use HoHs:

my %items = ( z => [ qw/one six/ ],
              'y' => [ qw/two three five/ ],
              x => [ qw/one two five/ ],
            );

my %corr;

foreach my $item (keys %items) {
   my $max = scalar @{$items{$item}} - 1;
   for my $first (0..$max) {
      for my $second ($first+1..$max) {
         $corr{ $items{$item}[$first] }{ $items{$item}[$second] }++;
      }
   }
}

for my $first (keys %corr) {
   for my $second (keys %{$corr{$first}}) {
      print "$first $second: $corr{$first}{$second}\n";
   }
}
[download]

Update: Note that this isn't a full correlation calculation, but implements the OPs desired numbers for a 2-way measure. As tall_man pointed out, I blew it :) Here is the corrected code:

foreach my $item (keys %items) {
   my @set = sort @{$items{$item}};
   for my $first (0..$#set) {
      for my $second ($first+1..$#set) {
         $corr{ $set[$first] }{ $set[$second] }++;
      }
   }
}

for my $first (sort keys %corr) {
   for my $second (sort keys %{$corr{$first}}) {
      print "$first $second: $corr{$first}{$second}\n";
   }
}
[download]

-Mark

Comment on Re: algorithm for 'best subsets' Select or Download Code

Replies are listed 'Best First'.
Re^2: algorithm for 'best subsets' by tall_man (Parson) on Mar 03, 2005 at 01:19 UTC
You need to make sure the keys appear in the same order in all the arrays, or you could miss a correlation. For example, if x had "five one two" instead, there would be nothing above a one in the answers.	[reply]