Best Pairs

artist has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Best Pairs (O(n) 15mins/million if ...) by BrowserUk (Patriarch) on Nov 08, 2003 at 01:50 UTC
If the number of discrete values that can appear in your small arrays is less than 32 -- they do not have to be contiguous -- then you can represent each array by a 32-bit number, with each bit representing one of the numbers. This reduces the space required for each million arrays of up to 32 numbers from around 900MB to 4MB. `...101001010010001 \|\|\|\|\| \|\|\|\|+- 1 \|\|\|+-- 16 \|\|\|+-- 75 \|\|+--- 204 \|+---- 205 +----- 206 etc.` [download] This also allows the 528 possible pairs to each be represented by a single, 32-bit number with 2-bits set. Counting the pairs becomes a process of iterating the 'arrays' and masking them against the pairs, stored as keys in a hash,and incrementing the values. Finding the top three counts in the resultant hash is easy. This process reduces the problem to O(number of arrays), and on my machine allows them to count the pairs at a rate of a little over 1 second per thousand, or around fifteen minutes/1 million with a storage requirement of 4MB per million + plus startup. #! perl -slw use strict; use List::Util qw[ max ]; use Benchmark::Timer; my $T = new Benchmark::Timer; our $ARRAYS \|\|= 10; our $K \|\|= 3; # Generate all the possible combinations of pairs # And create a hash with them as the keys. my %pairs; for my $n ( 0 .. 19 ){ $pairs{ 1<<$n \| 2<<$_ } = 0 for $n .. 19; }; # Generate an 'array' of 1 million 'arrays' with up to 20 # numbers in each. my $arrays = ''; vec( $arrays, $_, 32 ) = int rand( 1 << 19 ) for 1 .. $ARRAYS; # Count the pairs $T->start( 'Count the pairs' ); for my $pair ( keys %pairs ) { for ( 0 .. $ARRAYS-1 ) { $pairs{ $pair }++ if ( vec( $arrays, $_, 32 ) & $pair ) == $pa +ir; } } $T->stop( 'Count the pairs' ); # Find the top K pairs $T->start( 'Find the top K' ); my @topK = ( keys %pairs )[ 1 .. 3 ]; for my $pair ( keys %pairs ) { if( $pairs{ $pair } > max( @pairs{ @topK } ) ) { push @topK, $pair; shift @topK if @topK > 3; } } $T->stop( 'Find the top K' ); # Display the results for my $k ( @topK ) { printf 'Pair [ %d, %d ] ', grep{ $k & (1<<$_) ? $_ : () } 0 .. 19; printf 'appeared %d times'. $/, $pairs{ $k }; } $T->report; print 'Record memory usage';<stdin>; __END__ P:\test>305470 -ARRAYS=10000 -K=3 Pair [ 12, 14 ] appeared 2521 times Pair [ 4, 17 ] appeared 2579 times Pair [ 7, 17 ] appeared 2582 times 1 trial of Count the pairs (10.315s total) 1 trial of Find the top K (10.014ms total) Record memory usage 2772k P:\test>305470 -ARRAYS=100000 -K=3 Pair [ 4, 17 ] appeared 25165 times Pair [ 7, 16 ] appeared 25190 times Pair [ 7, 11 ] appeared 25348 times 1 trial of Count the pairs (106.954s total) 1 trial of Find the top K (10.015ms total) Record memory usage 3124k P:\test>305470 -ARRAYS=200000 -K=3 Pair [ 9, 17 ] appeared 50261 times Pair [ 7, 13 ] appeared 50356 times Pair [ 9, 13 ] appeared 50386 times 1 trial of Count the pairs (213.307s total) 1 trial of Find the top K (10.014ms total) Record memory usage 3524k [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply] [d/l] [select]
Re: Best Pairs by Anonymous Monk on Nov 08, 2003 at 01:46 UTC
The following is relatively simple and efficient. The result is the top K pairs, with preference given to higher numbers in the event of a tie. Integer data from 1 to N allow us to use an array in place of a hash (the indices of the array form the required "set"): `# preconditions: my $elem = 1; my $k = 3; my $n_arrays = [map [split], <DATA>]; # algorithm: my @pairs = (0) x (@$n_arrays + 1); for (@$n_arrays) { next unless grep $_==$elem, @$_[0..$elem]; $_ != $elem and $pairs[$_]++ for @$_; } my @top_k = (sort{$pairs[$a]<=>$pairs[$b]} 1..$#pairs)[reverse -$k..-1 +]; print "@top_k\n"; __END__ 2 4 5 7 8 10 1 2 5 6 7 9 2 6 7 8 9 10 1 3 5 10 1 3 4 5 6 8 9 1 2 4 6 1 2 4 5 7 10 1 3 4 6 7 8 9` [download]	[reply] [d/l]
Re: Re: Best Pairs by Roger (Parson) on Nov 08, 2003 at 02:11 UTC
I think this approach is ok for small set of data, and gets inefficient when the input data is in the millions of lines, your algorithm requires the entire file to be read in memory (what if 200million lines?) To be more efficient, you should be looking for an algorithm that has a smaller/predictable memory footprint, reading the entire data file into memory is not an ideal option.	[reply]
Re: Re: Re: Best Pairs by Anonymous Monk on Nov 08, 2003 at 02:19 UTC
I was coding to the original spec which clearly stated that he had N arrays, not a text file or a database or some other datastore from which to draw the data in less memory consumptive chunks! I only used reading in from DATA to create the arrays and satisfy the precondition of the problem statement.	[reply]
Re: Best Pairs by Jaap (Curate) on Nov 07, 2003 at 22:49 UTC
I don't understand what you mean with maximum pair. How is it calculated?	[reply]
Re: Re: Best Pairs by jweed (Chaplain) on Nov 07, 2003 at 23:27 UTC
I believe that the idea is this: There is a set of (sorted?) arrays, which contain numbers. Given a number (say, 1) and the number of pairs (say, 3), find the 3 combinations of two numbers, one of which is one, occurring in each array. Note that the 4 in the first array in the data set above does not form a pair with the 1 in the second array because they're not part of the same array. But the 1 and 4 in the last four arrays DO count. To me, the solution could come from extensive pre-processing (i.e. creation of an extensive hash data structure, which would allow really fast lookup but which would make it hard to preprocess) or to do a lookup for each entered query. I would go for the first option. I do have a few questions, however. Do numbers ever repeat w/i an array, like 1 1 2 3 4 4 5 6 You note that (1,5), (1,6), and (1,4) all appear four times, so your query for the 3 highest makes sense. What happens, though, if (1,3) also had 4 entries? If you ask for the top 3, should it just discard one pair randomly. Who is Kayser Söze?	[reply]
Re: Best Pairs by jweed (Chaplain) on Nov 08, 2003 at 00:47 UTC
Here is my try. It should be better commented, but the only tricky line is commented, thankfully. :) I must warn you that it is extremely ugly. I'm sure a more elegant solution is possible. It is instantaneous on your given data set (and returns the right values, joy!). Comments welcome. (I'm a noob so I need all the comments I can get...) EDIT: The code doesn't exactly work as instructed. It counts combinations when the two numbers are in different lines. Working on a fix... Read more... (2 kB) Who is Kayser Söze?	[reply] [d/l] [select]
Re: Best Pairs by Enlil (Parson) on Nov 08, 2003 at 01:59 UTC
Here is my take on this, but you did not mention what to do if the element did not exist at all, nor what to do if you ask for the top 3 elements, but there are 4 elements with the same max count (or 2 elements with the max count, and 3 elements with the next highest count), so the following code just returns the highest amounts found in that case (would return 4 elements, and 5 elements respectively). Read more... (1356 Bytes) -enlil	[reply] [d/l]
Re: Best Pairs by CountZero (Bishop) on Nov 08, 2003 at 10:16 UTC
A general optimisation remark: to reduce the memory foot-print, one can immediately reject all arrays in which the "anchor"-element ("1" in the given example) is not contained. In the given example this already reduces the number of arrays to process by 25%. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Best Pairs by CountZero (Bishop) on Nov 08, 2003 at 11:14 UTC
My take on a solution: use strict; use warnings; my $anchor = 1; # This is the basic element of the pairs we are lookin +g for my %results; # The final results will be found in this hash foreach (<DATA>) { my %hash; foreach (split) { # Put the array into a hash $hash{$_}=1; } if ($hash{$anchor}) { # Only process the hash if the anchor is pre +sent foreach (keys %hash) { # increment the frequencies in the resu +lts-hash ++$results{$_} unless $_==$anchor; # but omit the anchor o +r the sort will not work! } } } for my $x(sort {return $results{$b} <=> $results{$a}} keys %results) { print "$x = $results{$x}\n"; } __DATA__ 2 4 5 7 8 10 1 2 5 6 7 9 2 6 7 8 9 10 1 3 5 10 1 3 4 5 6 8 9 1 2 4 6 1 2 4 5 7 10 1 3 4 6 7 8 9 [download] This solution only reads in each array once one at a time (low memory use and linear time for processing) and only processes the arrays where the anchor element of the pairs we are looking for is present (should be reasonably fast: on my system it processes 10000 arrays with max. 30 elements each in less than 2 seconds). It outputs a sorted list of the number of times each element is found together with the array-element in the same array. Getting the top K-elements out of the %results-hash is left as an exercise for the reader. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l]
Re: Re: Best Pairs by Not_a_Number (Prior) on Nov 09, 2003 at 23:00 UTC
Here's another version. I was about to post it, but then I saw CountZero's solution above, which uses pretty much the same procedure. However, as mine looks more concise, I thought, why not? FWIW: `my $elem = 1; my $k = 3; my %out; while ( <DATA> ) { next unless /\b$elem\b/; $out{$_}++ for split; } delete $out{$elem}; print "$_ ($out{$_} times)\n" for (sort { $out{$b} <=> $out{$a} } keys %out)[ 0 .. ( $k - 1 ) ];` [download] dave	[reply] [d/l]
Re: Re: Re: Best Pairs by CountZero (Bishop) on Nov 10, 2003 at 15:12 UTC
TIMTOWTDI ++ for Not a Number CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Best Pairs by Roger (Parson) on Nov 07, 2003 at 23:35 UTC
One 'efficient' algorithm I can think of - Presteps: Step 1 - build a hash that holds the combination pairs of 1..N (math notation N C 2) as search keys, values initialized to 0. Step 2 - while iterating through the input lines/arrays, for each combination of the elements on the current line, increment value of `$hash{$comb_idx}` by 1. The aim of the prestep is to reduce the data set you have to work with (N=20, number of pair-combinations is just 190). The result hash will contain a count of the combinations in the data set. The hash key can be, simply, $n1 . '-' . $n2. Eg., '1-6', '3-18', etc. Poststeps: Step 1. For the given element, build a new hash by grepping for the given element in the keys of the original hash, and assigning the `key=>count` in the hash. Step 2. Sort this hash to find the hash key for the top N counts, and decode the hash keys to get the other numbers in the pair. The beauty of this solution is that once the hash of pair-combination-count is built (in the initialization phase), looking for 'maximum pairs' would be lightening fast. I can implement the above algorithm in Perl later, but I think it should be piece of cake for you, besides, you are only looking for an algorithm afterall. :-)	[reply] [d/l] [select]
Re: Best Pairs by jonadab (Parson) on Nov 08, 2003 at 03:08 UTC
You may want a compromise between the two extremes of caching the frequencies of all the pairs (which for large data sets will use too much RAM) and caching nothing (which for large data sets will use too much CPU). What about caching just a list of which sets each number occurs in? #!/usr/bin/perl my @set = map {[split /\s+/]} <DATA>; # We want for each number a list of which sets it's in: for (@set) { for (@$_) { push @{$sets{$_}}, $setnum+0; # +0 numifies undef. } $setnum++ } my ($element, $k) = (1, 3); # or =@ARGV or whatever. { my %count; # Note that this block could be a loop with different # values of $element and $k each time. for (@{$sets{$element}}) { for (uniq(grep { $_ != $element } @{$set[$_]})) { $count{$_}++; } } my @result = (sort { $count{$b} <=> $count{$a} } keys %count)[0..($k +-1)]; print "Results: ", (join ", ", map {"($element, $_)"} @result), $/; } sub uniq { my %used; return grep { !$used{$_}++ } @_ } __DATA__ 2 4 5 7 8 10 1 2 5 6 7 9 2 6 7 8 9 10 1 3 5 10 1 3 4 5 6 8 9 1 2 4 6 1 2 4 5 7 10 1 3 4 6 7 8 9 [download] `$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/` [download]	[reply] [d/l] [select]
Re: Re: Best Pairs by Anonymous Monk on Nov 08, 2003 at 04:22 UTC
Try printing out the values of your so-called intermediate sized cache `%sets` in your example. The expected size of your cache (for randomly distributed datasets) equals the size of the original dataset. The AM solution above is more efficient in both space and time (but for large N and sparse data, it should use a hash rather than an array to count frequencies).	[reply] [d/l]