Re: Best Pairs

You may want a compromise between the two extremes of caching the frequencies of all the pairs (which for large data sets will use too much RAM) and caching nothing (which for large data sets will use too much CPU). What about caching just a list of which sets each number occurs in?

#!/usr/bin/perl

my @set = map {[split /\s+/]} <DATA>;

# We want for each number a list of which sets it's in:
for (@set) {
  for (@$_) {
    push @{$sets{$_}}, $setnum+0; # +0 numifies undef.
  } $setnum++
}

my ($element, $k) = (1, 3); # or =@ARGV or whatever.
{ my %count;
  # Note that this block could be a loop with different
  # values of $element and $k each time.

  for (@{$sets{$element}}) {
    for (uniq(grep { $_ != $element } @{$set[$_]})) {
      $count{$_}++; }
  }

  my @result = (sort { $count{$b} <=> $count{$a} } keys %count)[0..($k
+-1)];

  print "Results:  ", (join ", ", map {"($element, $_)"} @result), $/;
}

sub uniq {
  my %used;
  return grep { !$used{$_}++ } @_
}

__DATA__
2 4 5 7 8 10
1 2 5 6 7 9
2 6 7 8 9 10
1 3 5 10
1 3 4 5 6 8 9
1 2 4 6
1 2 4 5 7 10
1 3 4 6 7 8 9
[download]

$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
[download]

Comment on Re: Best Pairs Select or Download Code

Replies are listed 'Best First'.
Re: Re: Best Pairs by Anonymous Monk on Nov 08, 2003 at 04:22 UTC
Try printing out the values of your so-called intermediate sized cache `%sets` in your example. The expected size of your cache (for randomly distributed datasets) equals the size of the original dataset. The AM solution above is more efficient in both space and time (but for large N and sparse data, it should use a hash rather than an array to count frequencies).	[reply] [d/l]