Hi

I realize this is a long shoot but maybe , just maybe someone willing to share can help me. The problem that I am trying to solve involves two integer arrays. One is static and one changes. Let x[] be the array whoes content does not change and y[] the one that changes. both arrays are 15000 int's long and they only contain values of 0 and 1. Example:

x[0 0 0 0 1 1 0 0 0 0 1 0 ...] y[1 0 0 0 1 0 1 0 1 0 0 0 ...]
there are about 5000000 y-arrays that i need to compare with 100000 x arrays. What I could do is to start from 0 and pairwise compare each entry(x1<=>y1, x1<=>y2 ... x2<=>y1, ...xn<=>yn). Though fast this is just not fast enough (by my calculations that could last several days on a cluster machine i have at my disposal and i need to repeat this app. 1000 times each month).
the other thing i could do is to preprocess y arrays so that i only store positions where 1's are. The for each y just check if x[y[i]] == 1 since there are about 500 times less 1's then 0's this is going to run much faster (and will save a lot of memory). And this is where i stopped. However, with this approach it takes me 1 sec to compare one x with all y arrays. When extrapolated to 100000 x arrays this turns out to be a lot of time, especially when considered that i need to repeat this 1000 times. Aditional information. My end result is not to know, for each (x,y) array how many 1's they share just to know what are the top 10 y arrays that share the most 1' with each x array. Distribution od 1's in both y and x arrays is random (by the looks of it they maybe uniformly distributed) My goal is to speed this comparison up at least 3 times because then i can do something with it. Obviously to speed this up i need to do as less comparisons (checkups) as possible. I was thinking about computing a hash key (a checksum of come sort) but then i would be able to identify only identical (x,y) pairs. I was also thinking of converting x array to a bit array but am not sure how will this help me speed-wise. I did rewrite this in c and i gained a lot of speed but still not enough. As you can see I am stuck. So if anyone has any input for me please do not hesitate to post, no matter how crazy or slow the idea is. And i apologize if this is not strictly speaking a Perl (syntax, code -wise) related issue.

Maybe there should be a subforum for this type of questions/discussions as Perl is being used more and more for theoretical/conceptual problem solving tasks , because you don't have to worry about technicalities and since a lot of people that do this type of work are coming from c there is no syntax leap.

Anyway, thank you

baxy


In reply to Comparing two arrays by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.