Re: Best Pairs (O(n) 15mins/million if ...)
by BrowserUk (Patriarch) on Nov 08, 2003 at 01:50 UTC
|
If the number of discrete values that can appear in your small arrays is less than 32 -- they do not have to be contiguous -- then you can represent each array by a 32-bit number, with each bit representing one of the numbers. This reduces the space required for each million arrays of up to 32 numbers from around 900MB to 4MB.
...101001010010001
|||||
||||+- 1
|||+-- 16
|||+-- 75
||+--- 204
|+---- 205
+----- 206
etc.
This also allows the 528 possible pairs to each be represented by a single, 32-bit number with 2-bits set.
Counting the pairs becomes a process of iterating the 'arrays' and masking them against the pairs, stored as keys in a hash,and incrementing the values.
Finding the top three counts in the resultant hash is easy. This process reduces the problem to O(number of arrays), and on my machine allows them to count the pairs at a rate of a little over 1 second per thousand, or around fifteen minutes/1 million with a storage requirement of 4MB per million + plus startup.
#! perl -slw
use strict;
use List::Util qw[ max ];
use Benchmark::Timer;
my $T = new Benchmark::Timer;
our $ARRAYS ||= 10;
our $K ||= 3;
# Generate all the possible combinations of pairs
# And create a hash with them as the keys.
my %pairs;
for my $n ( 0 .. 19 ){ $pairs{ 1<<$n | 2<<$_ } = 0 for $n .. 19; };
# Generate an 'array' of 1 million 'arrays' with up to 20
# numbers in each.
my $arrays = '';
vec( $arrays, $_, 32 ) = int rand( 1 << 19 ) for 1 .. $ARRAYS;
# Count the pairs
$T->start( 'Count the pairs' );
for my $pair ( keys %pairs ) {
for ( 0 .. $ARRAYS-1 ) {
$pairs{ $pair }++ if ( vec( $arrays, $_, 32 ) & $pair ) == $pa
+ir;
}
}
$T->stop( 'Count the pairs' );
# Find the top K pairs
$T->start( 'Find the top K' );
my @topK = ( keys %pairs )[ 1 .. 3 ];
for my $pair ( keys %pairs ) {
if( $pairs{ $pair } > max( @pairs{ @topK } ) ) {
push @topK, $pair;
shift @topK if @topK > 3;
}
}
$T->stop( 'Find the top K' );
# Display the results
for my $k ( @topK ) {
printf 'Pair [ %d, %d ] ', grep{ $k & (1<<$_) ? $_ : () } 0 .. 19;
printf 'appeared %d times'. $/, $pairs{ $k };
}
$T->report;
print 'Record memory usage';<stdin>;
__END__
P:\test>305470 -ARRAYS=10000 -K=3
Pair [ 12, 14 ] appeared 2521 times
Pair [ 4, 17 ] appeared 2579 times
Pair [ 7, 17 ] appeared 2582 times
1 trial of Count the pairs (10.315s total)
1 trial of Find the top K (10.014ms total)
Record memory usage
2772k
P:\test>305470 -ARRAYS=100000 -K=3
Pair [ 4, 17 ] appeared 25165 times
Pair [ 7, 16 ] appeared 25190 times
Pair [ 7, 11 ] appeared 25348 times
1 trial of Count the pairs (106.954s total)
1 trial of Find the top K (10.015ms total)
Record memory usage
3124k
P:\test>305470 -ARRAYS=200000 -K=3
Pair [ 9, 17 ] appeared 50261 times
Pair [ 7, 13 ] appeared 50356 times
Pair [ 9, 13 ] appeared 50386 times
1 trial of Count the pairs (213.307s total)
1 trial of Find the top K (10.014ms total)
Record memory usage
3524k
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!
Wanted!
| [reply] [d/l] [select] |
Re: Best Pairs
by Anonymous Monk on Nov 08, 2003 at 01:46 UTC
|
The following is relatively simple and efficient. The result is the
top K pairs, with preference given to higher numbers in the event of
a tie. Integer data from 1 to N allow us to use an array in place of
a hash (the indices of the array form the required "set"):
# preconditions:
my $elem = 1;
my $k = 3;
my $n_arrays = [map [split], <DATA>];
# algorithm:
my @pairs = (0) x (@$n_arrays + 1);
for (@$n_arrays) {
next unless grep $_==$elem, @$_[0..$elem];
$_ != $elem and $pairs[$_]++ for @$_;
}
my @top_k = (sort{$pairs[$a]<=>$pairs[$b]} 1..$#pairs)[reverse -$k..-1
+];
print "@top_k\n";
__END__
2 4 5 7 8 10
1 2 5 6 7 9
2 6 7 8 9 10
1 3 5 10
1 3 4 5 6 8 9
1 2 4 6
1 2 4 5 7 10
1 3 4 6 7 8 9
| [reply] [d/l] |
|
|
I think this approach is ok for small set of data, and gets inefficient when the input data is in the millions of lines, your algorithm requires the entire file to be read in memory (what if 200million lines?)
To be more efficient, you should be looking for an algorithm that has a smaller/predictable memory footprint, reading the entire data file into memory is not an ideal option.
| [reply] |
|
|
I was coding to the original spec which clearly stated that he had N
arrays, not a text file or a database or some other datastore from which to
draw the data in less memory consumptive chunks! I only used reading in
from DATA to create the arrays and satisfy the precondition of the problem
statement.
| [reply] |
Re: Best Pairs
by Jaap (Curate) on Nov 07, 2003 at 22:49 UTC
|
I don't understand what you mean with maximum pair. How is it calculated? | [reply] |
|
|
I believe that the idea is this:
There is a set of (sorted?) arrays, which contain numbers. Given a number (say, 1) and the number of pairs (say, 3), find the 3 combinations of two numbers, one of which is one, occurring in each array. Note that the 4 in the first array in the data set above does not form a pair with the 1 in the second array because they're not part of the same array. But the 1 and 4 in the last four arrays DO count.
To me, the solution could come from extensive pre-processing (i.e. creation of an extensive hash data structure, which would allow really fast lookup but which would make it hard to preprocess) or to do a lookup for each entered query. I would go for the first option.
I do have a few questions, however.
- Do numbers ever repeat w/i an array, like
1 1 2 3 4 4 5 6
- You note that (1,5), (1,6), and (1,4) all appear four times, so your query for the 3 highest makes sense. What happens, though, if (1,3) also had 4 entries? If you ask for the top 3, should it just discard one pair randomly.
| [reply] |
Re: Best Pairs
by jweed (Chaplain) on Nov 08, 2003 at 00:47 UTC
|
Here is my try. It should be better commented, but the only tricky line is commented, thankfully. :)
I must warn you that it is extremely ugly. I'm sure a more elegant solution is possible.
It is instantaneous on your given data set (and returns the right values, joy!). Comments welcome. (I'm a noob so I need all the comments I can get...)
EDIT: The code doesn't exactly work as instructed. It counts combinations when the two numbers are in different lines. Working on a fix...
| [reply] [d/l] [select] |
Re: Best Pairs
by Enlil (Parson) on Nov 08, 2003 at 01:59 UTC
|
Here is my take on this, but you did not mention what to do if the element did not exist at all, nor what to do if you ask for the top 3 elements, but there are 4 elements with the same max count (or 2 elements with the max count, and 3 elements with the next highest count), so the following code just returns the highest amounts found in that case (would return 4 elements, and 5 elements respectively).
-enlil | [reply] [d/l] |
Re: Best Pairs
by CountZero (Bishop) on Nov 08, 2003 at 10:16 UTC
|
A general optimisation remark: to reduce the memory foot-print, one can immediately reject all arrays in which the "anchor"-element ("1" in the given example) is not contained. In the given example this already reduces the number of arrays to process by 25%.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] |
Re: Best Pairs
by CountZero (Bishop) on Nov 08, 2003 at 11:14 UTC
|
My take on a solution:
use strict;
use warnings;
my $anchor = 1; # This is the basic element of the pairs we are lookin
+g for
my %results; # The final results will be found in this hash
foreach (<DATA>) {
my %hash;
foreach (split) { # Put the array into a hash
$hash{$_}=1;
}
if ($hash{$anchor}) { # Only process the hash if the anchor is pre
+sent
foreach (keys %hash) { # increment the frequencies in the resu
+lts-hash
++$results{$_} unless $_==$anchor; # but omit the anchor o
+r the sort will not work!
}
}
}
for my $x(sort {return $results{$b} <=> $results{$a}} keys %results) {
print "$x = $results{$x}\n";
}
__DATA__
2 4 5 7 8 10
1 2 5 6 7 9
2 6 7 8 9 10
1 3 5 10
1 3 4 5 6 8 9
1 2 4 6
1 2 4 5 7 10
1 3 4 6 7 8 9
This solution only reads in each array once one at a time (low memory use and linear time for processing) and only processes the arrays where the anchor element of the pairs we are looking for is present (should be reasonably fast: on my system it processes 10000 arrays with max. 30 elements each in less than 2 seconds). It outputs a sorted list of the number of times each element is found together with the array-element in the same array. Getting the top K-elements out of the %results-hash is left as an exercise for the reader.
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [d/l] |
|
|
Here's another version. I was about to post it, but then I saw CountZero's solution above, which uses pretty much the same procedure. However, as mine looks more concise, I thought, why not? FWIW:
my $elem = 1;
my $k = 3;
my %out;
while ( <DATA> ) {
next unless /\b$elem\b/;
$out{$_}++ for split;
}
delete $out{$elem};
print "$_ ($out{$_} times)\n" for
(sort { $out{$b} <=> $out{$a} } keys %out)[ 0 .. ( $k - 1 ) ];
dave | [reply] [d/l] |
|
|
TIMTOWTDI++ for Not a Number
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] |
Re: Best Pairs
by Roger (Parson) on Nov 07, 2003 at 23:35 UTC
|
One 'efficient' algorithm I can think of -
Presteps:
Step 1 - build a hash that holds the combination pairs of 1..N (math notation N C 2) as search keys, values initialized to 0.
Step 2 - while iterating through the input lines/arrays, for each combination of the elements on the current line, increment value of $hash{$comb_idx} by 1.
The aim of the prestep is to reduce the data set you have to work with (N=20, number of pair-combinations is just 190). The result hash will contain a count of the combinations in the data set. The hash key can be, simply, $n1 . '-' . $n2. Eg., '1-6', '3-18', etc.
Poststeps:
Step 1. For the given element, build a new hash by grepping for the given element in the keys of the original hash, and assigning the key=>count in the hash.
Step 2. Sort this hash to find the hash key for the top N counts, and decode the hash keys to get the other numbers in the pair.
The beauty of this solution is that once the hash of pair-combination-count is built (in the initialization phase), looking for 'maximum pairs' would be lightening fast.
I can implement the above algorithm in Perl later, but I think it should be piece of cake for you, besides, you are only looking for an algorithm afterall.
:-)
| [reply] [d/l] [select] |
Re: Best Pairs
by jonadab (Parson) on Nov 08, 2003 at 03:08 UTC
|
You may want a compromise between the two extremes
of caching the frequencies of all the pairs (which
for large data sets will use too much RAM) and
caching nothing (which for large data sets will
use too much CPU). What about caching just a list
of which sets each number occurs in?
#!/usr/bin/perl
my @set = map {[split /\s+/]} <DATA>;
# We want for each number a list of which sets it's in:
for (@set) {
for (@$_) {
push @{$sets{$_}}, $setnum+0; # +0 numifies undef.
} $setnum++
}
my ($element, $k) = (1, 3); # or =@ARGV or whatever.
{ my %count;
# Note that this block could be a loop with different
# values of $element and $k each time.
for (@{$sets{$element}}) {
for (uniq(grep { $_ != $element } @{$set[$_]})) {
$count{$_}++; }
}
my @result = (sort { $count{$b} <=> $count{$a} } keys %count)[0..($k
+-1)];
print "Results: ", (join ", ", map {"($element, $_)"} @result), $/;
}
sub uniq {
my %used;
return grep { !$used{$_}++ } @_
}
__DATA__
2 4 5 7 8 10
1 2 5 6 7 9
2 6 7 8 9 10
1 3 5 10
1 3 4 5 6 8 9
1 2 4 6
1 2 4 5 7 10
1 3 4 6 7 8 9
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
| [reply] [d/l] [select] |
|
|
Try printing out the values of your so-called intermediate sized cache
%sets in your example. The expected size of your cache (for
randomly distributed datasets) equals the size of the original dataset.
The AM solution above is more efficient in both space and time (but
for large N and sparse data, it should use a hash rather than an array
to count frequencies).
| [reply] [d/l] |