in reply to Best Pairs

The following is relatively simple and efficient. The result is the top K pairs, with preference given to higher numbers in the event of a tie. Integer data from 1 to N allow us to use an array in place of a hash (the indices of the array form the required "set"):

# preconditions: my $elem = 1; my $k = 3; my $n_arrays = [map [split], <DATA>]; # algorithm: my @pairs = (0) x (@$n_arrays + 1); for (@$n_arrays) { next unless grep $_==$elem, @$_[0..$elem]; $_ != $elem and $pairs[$_]++ for @$_; } my @top_k = (sort{$pairs[$a]<=>$pairs[$b]} 1..$#pairs)[reverse -$k..-1 +]; print "@top_k\n"; __END__ 2 4 5 7 8 10 1 2 5 6 7 9 2 6 7 8 9 10 1 3 5 10 1 3 4 5 6 8 9 1 2 4 6 1 2 4 5 7 10 1 3 4 6 7 8 9

Replies are listed 'Best First'.
Re: Re: Best Pairs
by Roger (Parson) on Nov 08, 2003 at 02:11 UTC
    I think this approach is ok for small set of data, and gets inefficient when the input data is in the millions of lines, your algorithm requires the entire file to be read in memory (what if 200million lines?)

    To be more efficient, you should be looking for an algorithm that has a smaller/predictable memory footprint, reading the entire data file into memory is not an ideal option.

      I was coding to the original spec which clearly stated that he had N arrays, not a text file or a database or some other datastore from which to draw the data in less memory consumptive chunks! I only used reading in from DATA to create the arrays and satisfy the precondition of the problem statement.