baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

does anyone know an algorithm for slowing the following problem faster then the solution proposed belove?

Problem :

Let D be a 2d array of integers.

[0] -> 1 4 6 8 9 ... [1] -> 1 3 5 6 20 ... [2] -> 2 3 4 5 6 ... [3] -> 5 7 8 9 12 ... [4] -> 3 5 8 11 13... [5] -> 4 5 7 8 9 ... [6] -> 1 4 5 7 8 ... ...
Integers range form 1 to 3000000. there are approximately 50000 ints in each [x] array. the number is not fixed and can be between 1 and 3000000. In each [x] array the numbers are ordered form smallest to largest (ALWAYS). Given n integers (example: 1,2,5,6) find top x integers in arrays [i_{1}]..[i_{n}] ([1],[2],[5],[6]) that are shared between them. in my case if x is 2 then my top 2 int values would be :
1. 5 -> is shared between all four arrays ([1],[2], [5], [6]) 2. 4 -> is shared between 3 arrays ([2],[5], [6])
my solution (for the above small example): make a hash table using numbers in each [d] array, as keys and simply if a number has been encountered increment it. afterworlds just sort the array (biggest count to smallest) and pick first two.
for (1,2,5,6){ foreach $ar (@{$D[$_]}){ $hash{$ar}++; } } # sort %hash #pick top two

however, such solution, if reaped a large number of times or if the numbers are large, tends not to be practical. Does anyone has any suggestion on how to pre-process the 2d array in order to speed the computation and save memory (hashes are expensive when it comes to memory)

thnx

PS

I was thinking of 2D-RMQ solution but i haven't looked into it yet hoping that there might be a slicker solution.

Replies are listed 'Best First'.
Re: Fast algorithm for 2d array queries
by BrowserUk (Patriarch) on Feb 07, 2014 at 10:01 UTC

    Problem You've spent over half your post describing your solution for a minimal version of your problem, that doesn't scale. But you haven't actually state what the problem is.

    For example. You've told us that you've a 2d array of integers; that those integers range between 1 & 3e6 (twice) ; and there are ~50000 in each inner array.

    But you fail to say how big the main array is?

    Or what you need to look up. Ie. what do you start with; and what information do you need to end up with?

    Or how many lookups you need to do?

    And will you do these lookups once? Or once a week? Or once an hour?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Fair enough

      "But you fail to say how big the main array is? "

      not that big, the size of the main array is 3000. n is 500-1000 (the number of inner arrays from which i need to pick)

      "Or what you need to look up. Ie. what do you start with; and what information do you need to end up with?"

      i start with my query array which is 500-1000 different ints between 0 and 2999 and i need to end up with the int value from inner arrays that is the most prevalent on those picked 500-1000 arrays. (intersect that has the highest number of elements)

      "Or how many lookups you need to do? "

      about 15000000 per hour.

        "Or how many lookups you need to do? " about 15000000 per hour.

        That's 4000 per second or 1 every 1/4 of a millisecond.

        In that time you want to intersect 500 to 1000 (from 3,000) sets of 50,000 integers and extract the single most populous integer across them all.

        Do you have a 3000 machine cluster available to throw at this problem?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast algorithm for 2d array queries
by oiskuu (Hermit) on Feb 07, 2014 at 23:57 UTC

    A problem rather similar to Comparing two arrays, don't you think? Sparse matrix again, this time 3k by 3M booleans, density 1/60. Exact same solution is viable, too: pack your int-vectors as "l/l", merge-sort n query vectors and scan. Perhaps an 8-core machine to achieve 4k queries per sec.

    Speed vs memory trade-off is also possible. Pairwise intersections of your 3000 vectors amount to 4.5M vectors of size ~833. Lookup with small n==4 is 6 combinations ie merge 6*833 == 5k elements instead of 4*50k == 200k elements. About 30-fold speed-up at the cost of 14 GB of memory.

    GPU-based solution would be quite interesting, but for that you really ought to ask another forum.

      Perhaps an 8-core machine to achieve 4k queries per sec.

      Prove it :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        See the referenced thread. I later implemented a partial SSE version as well (merge using SSE, scan not optimized). Result:

        Total 301068 elements in 30 vectors timethis for 5: 5 wallclock secs ( 5.31 usr + 0.00 sys = 5.31 CPU) +@ 439.36/s (n=2333)
        Update:
        Total 200752 elements in 4 vectors timethis for 5: 6 wallclock secs ( 5.28 usr + 0.00 sys = 5.28 CPU) +@ 1326.33/s (n=7003)
        Update2: Right you are, BrowserUk, I was considering small n case only.
        Total 50063728 elements in 1000 vectors timethis for 5: 6 wallclock secs ( 6.32 usr + 0.00 sys = 6.32 CPU) +@ 0.63/s (n=4)