Fast algorithm for 2d array queries

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

does anyone know an algorithm for slowing the following problem faster then the solution proposed belove?

Problem :

Let D be a 2d array of integers.


[0] ->  1 4 6 8 9 ...
[1] ->  1 3 5 6 20 ...
[2] ->  2 3 4 5 6 ...
[3] ->  5 7 8 9 12 ...
[4] ->  3 5 8 11 13...
[5] ->  4 5 7 8 9 ...
[6] ->  1 4 5 7 8 ...
...
[download]

Integers range form 1 to 3000000. there are approximately 50000 ints in each [x] array. the number is not fixed and can be between 1 and 3000000. In each [x] array the numbers are ordered form smallest to largest (ALWAYS). Given n integers (example: 1,2,5,6) find top x integers in arrays [i_{1}]..[i_{n}] ([1],[2],[5],[6]) that are shared between them. in my case if x is 2 then my top 2 int values would be :

1. 5 -> is shared between all four arrays ([1],[2], [5], [6])
2. 4  -> is shared between 3 arrays ([2],[5], [6])
[download]

my solution (for the above small example): make a hash table using numbers in each [d] array, as keys and simply if a number has been encountered increment it. afterworlds just sort the array (biggest count to smallest) and pick first two.

for (1,2,5,6){
   foreach $ar (@{$D[$_]}){
      $hash{$ar}++;
   }
}

# sort %hash
#pick top two
[download]

however, such solution, if reaped a large number of times or if the numbers are large, tends not to be practical. Does anyone has any suggestion on how to pre-process the 2d array in order to speed the computation and save memory (hashes are expensive when it comes to memory)

thnx

I was thinking of 2D-RMQ solution but i haven't looked into it yet hoping that there might be a slicker solution.

Comment on Fast algorithm for 2d array queries Select or Download Code

Replies are listed 'Best First'.
Re: Fast algorithm for 2d array queries by BrowserUk (Patriarch) on Feb 07, 2014 at 10:01 UTC
Problem You've spent over half your post describing your solution for a minimal version of your problem, that doesn't scale. But you haven't actually state what the problem is. For example. You've told us that you've a 2d array of integers; that those integers range between 1 & 3e6 (twice) ; and there are ~50000 in each inner array. But you fail to say how big the main array is? Or what you need to look up. Ie. what do you start with; and what information do you need to end up with? Or how many lookups you need to do? And will you do these lookups once? Or once a week? Or once an hour? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Fast algorithm for 2d array queries by baxy77bax (Deacon) on Feb 07, 2014 at 10:24 UTC
Fair enough "But you fail to say how big the main array is? " not that big, the size of the main array is 3000. n is 500-1000 (the number of inner arrays from which i need to pick) "Or what you need to look up. Ie. what do you start with; and what information do you need to end up with?" i start with my query array which is 500-1000 different ints between 0 and 2999 and i need to end up with the int value from inner arrays that is the most prevalent on those picked 500-1000 arrays. (intersect that has the highest number of elements) "Or how many lookups you need to do? " about 15000000 per hour.	[reply]
Re^3: Fast algorithm for 2d array queries (Got a cluster?) by BrowserUk (Patriarch) on Feb 07, 2014 at 11:25 UTC
"Or how many lookups you need to do? " about 15000000 per hour. That's 4000 per second or 1 every 1/4 of a millisecond. In that time you want to intersect 500 to 1000 (from 3,000) sets of 50,000 integers and extract the single most populous integer across them all. Do you have a 3000 machine cluster available to throw at this problem? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Fast algorithm for 2d array queries (Got OpenGL::Array) by Anonymous Monk on Feb 07, 2014 at 12:13 UTC
Re^5: Fast algorithm for 2d array queries (Got OpenGL::Array) by BrowserUk (Patriarch) on Feb 07, 2014 at 12:36 UTC
Some notes below your chosen depth have not been shown here
Re: Fast algorithm for 2d array queries by oiskuu (Hermit) on Feb 07, 2014 at 23:57 UTC
A problem rather similar to Comparing two arrays, don't you think? Sparse matrix again, this time 3k by 3M booleans, density 1/60. Exact same solution is viable, too: pack your int-vectors as `"l/l"`, merge-sort n query vectors and scan. Perhaps an 8-core machine to achieve 4k queries per sec. Speed vs memory trade-off is also possible. Pairwise intersections of your 3000 vectors amount to 4.5M vectors of size ~833. Lookup with small n==4 is 6 combinations ie merge 6833 == 5k elements instead of 450k == 200k elements. About 30-fold speed-up at the cost of 14 GB of memory. GPU-based solution would be quite interesting, but for that you really ought to ask another forum.	[reply] [d/l]
Re^2: Fast algorithm for 2d array queries by BrowserUk (Patriarch) on Feb 08, 2014 at 00:30 UTC
Perhaps an 8-core machine to achieve 4k queries per sec. Prove it :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Fast algorithm for 2d array queries by oiskuu (Hermit) on Feb 08, 2014 at 00:48 UTC
See the referenced thread. I later implemented a partial SSE version as well (merge using SSE, scan not optimized). Result: `Total 301068 elements in 30 vectors timethis for 5: 5 wallclock secs ( 5.31 usr + 0.00 sys = 5.31 CPU) +@ 439.36/s (n=2333)` [download] Update: `Total 200752 elements in 4 vectors timethis for 5: 6 wallclock secs ( 5.28 usr + 0.00 sys = 5.28 CPU) +@ 1326.33/s (n=7003)` [download] Update2: Right you are, BrowserUk, I was considering small n case only. `Total 50063728 elements in 1000 vectors timethis for 5: 6 wallclock secs ( 6.32 usr + 0.00 sys = 6.32 CPU) +@ 0.63/s (n=4)` [download]	[reply] [d/l] [select]
Re^4: Fast algorithm for 2d array queries by BrowserUk (Patriarch) on Feb 08, 2014 at 00:56 UTC
Re^4: Fast algorithm for 2d array queries by BrowserUk (Patriarch) on Feb 08, 2014 at 09:02 UTC