Hi ppl, i need some clever suggestion for the following problem. what i have is a set of coordinates with some , let say user names associated to it:
start,stop 22,25 >uname1 344,360 >uname2 433,540 >uname3 432,532 >uname4
on the other side i have another set of coordinates
start,stop 21,23 >job_id1 255,345 >job_id2 345,355 >job_id3 356,366 >job_id4
what i'm trying to figure out is which users at specific intervals ran a specific job. so im trying to map job id's intervals to uname intervals. where rules are that even if only part of the job_id_interval crosses the uname interval, this should be reported. the thing, is there are over 20 million of such intervals(intervals overlap) in each group and the size of allowed interval in both cases is the same and spans from 1 to 20 million.

now what i was thinking about is to, using a Bit::Vector libraries, create vector field and map the both coordinated on the the vector space, then see where they overlap and just remove the non-overlapping fields. but then it hit me how will i track down which unames and job_ids those overlaps belong to. then i thought about hashing. but how will i the find a coordinate in my hash key that is less then X

find: my $uname = $hash{> then $start and <= then stop} #???????
i mean i would need to sort hash keys and the loop through them to find hash keys that are >= to some start key(id) and <= to some stop key(id).

and now i'm stuck and crying to you for help.

so let me summarize my problem : i need to map if possible job_id's intervals to uname intervals and preserve >uname1 >job_id2 tags. keep in mind that those datasets have piled up over the years and are quite large. so some simple loop within a loop would not be a good solution

thank you

baxy

PS

max for the coordinates in both cases is 20000000

PPS

to moritz

this is a fraction from the real data set but don't worry about that since. the dataset, as i said large, and i cannot by hand pick real representative data to illustrate the problem

this corresponds to the job lines 14230157,14230182,3445:7:3:707:620 3437306,3439308,3445:7:3:990:634 14593103,14593128,3445:7:3:537:287 16948765,16948768,127305:7:3:49:800 12044820,12044845,127303:7:3:686:44 11310494,11310519,127340:7:3:67:320 19408728,19408753,127438:7:3:508:614 17007683,17007685,127439:7:3:481:403
and these are unames :
16820359,16821584,5:7:3:1:5 17979480,17999505,4:7:3:948:200 12491787,14491812,4:7:3:784:575 17389967,18389969,34:7:3:617:920 11671837,19671839,34:7:3:516:921
as i said this is probably not a a good example for the problem illustration so please do refer to the example above :)

In reply to mapping coordinates- suggestion needed by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.