in reply to algorithm for 'best subsets'

As for algorithm, it seems like you are just looking for large correlations among keywords. Somtimes people look for large coorelations among keywords as a marker for eliminating redundant variables. If A and B always appear together, you don't need both. Such redunandcy reduction is often used in statisitcal modeling to arrive at an independent set of variables upon which to base a model.

As for code to implement this measure, it is simple to use HoHs:

my %items = ( z => [ qw/one six/ ], 'y' => [ qw/two three five/ ], x => [ qw/one two five/ ], ); my %corr; foreach my $item (keys %items) { my $max = scalar @{$items{$item}} - 1; for my $first (0..$max) { for my $second ($first+1..$max) { $corr{ $items{$item}[$first] }{ $items{$item}[$second] }++; } } } for my $first (keys %corr) { for my $second (keys %{$corr{$first}}) { print "$first $second: $corr{$first}{$second}\n"; } }
Update: Note that this isn't a full correlation calculation, but implements the OPs desired numbers for a 2-way measure. As tall_man pointed out, I blew it :) Here is the corrected code:
foreach my $item (keys %items) { my @set = sort @{$items{$item}}; for my $first (0..$#set) { for my $second ($first+1..$#set) { $corr{ $set[$first] }{ $set[$second] }++; } } } for my $first (sort keys %corr) { for my $second (sort keys %{$corr{$first}}) { print "$first $second: $corr{$first}{$second}\n"; } }

-Mark

Replies are listed 'Best First'.
Re^2: algorithm for 'best subsets'
by tall_man (Parson) on Mar 03, 2005 at 01:19 UTC
    You need to make sure the keys appear in the same order in all the arrays, or you could miss a correlation. For example, if x had "five one two" instead, there would be nothing above a one in the answers.