in reply to Re^2: mapping coordinates- suggestion needed
in thread mapping coordinates- suggestion needed

Let's see. If the "coordinates" of the time intervals are in seconds, 20e6 covers about 8 months, which doesn't really tally with your "those datasets have piled up over the years".

But if they were in minutes, then it represents 39 years. Why would anyone care what user ran what job 38 years ago? They've an above average chance of being dead already.

And then there is the "rules are that even if only part of the job_id_interval crosses the uname interval, this should be reported." bit. How can a userid be responsible for a particular job if they didn't log on until the last (second|minute|lesser known time unit) of the jobs life?

And then there's that coincidence between there being 20e6 logons, 20e6 jobs; & 20e6 intervals during which those logos and job runs occurred?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy
  • Comment on Re^3: mapping coordinates- suggestion needed

Replies are listed 'Best First'.
Re^4: mapping coordinates- suggestion needed
by baxy77bax (Deacon) on Oct 15, 2010 at 06:17 UTC
    ok,

    i tried to make the problem as simple as possible. the lines(coordinates for the job id's and uname's) are fair share time coefficients.

    at one point i was so pissed with sge fair-share system, priority list and job dispatcher that i wrote simmilar engine for myself. and this thing has now been running for a 1,5 years.

    reason i post this question is because you guys always manage to come up with a more elegant solution (and i'm not looking for an exact code, just a suggestion)

    And then there's that coincidence between there being 20e6 logons, 20e6 jobs; & 20e6 intervals during which those logos and job runs occurred?

    yes, logical observation and the answer would be obvious if i explained how exactly the grid engine works. but there would be too many if's and then's if i explained everything.As i said i tried to simplify the problem as best as i could

    cheers

      I'm always suspicious of large round numbers. Even more so of coincident large round numbers.

      All too frequently, these are an indication that some "future-proofing" guestimation has been brought to bear upon the real problem.

      We're currently dealing with nK items/time period--but we're a young, go-get'em company in our first year, and we'll have exponential growth over the next 20 quarters, so we have to plan for: nK * 1.over--optimistic--guess20. Rounded up to the next 'nice' number of course.

      And in the process, what could be a nice, fast, self-contained, memory-resident algorithm becomes a messy, laborious, 1000s-of-times-slower, disk-based solution.

      It may seem sensible at first glance to "plan for the future" in this way--but then reality kicks in. Usually, the guestimates are way optimistic, but even if the final numbers are in the right ballpark, coding for them now does you a disservice.

      This because planning for those final numbers now--on your current hardware--fails to account for the fact that by the time you reach those numbers, you'll have gone through at least one, if not 2 upgrades of hardware. That's if you haven't been taken over, sold out or gone bust.

      And with each new hardware upgrade, the limits--and cost--of ram roughly double/halve respectively. Whereas the speed of disk remains relatively stagnant(*), meaning that the penalties of disk-based solutions effectively more than double.

      (*) SSD notwithstanding. Whilst they are (say) an order of magnitude faster than HDD, they are still (say) 2 orders of magnitude slower than memory.

      If, realistically, your current hardware can see you through the next 18/24 months with a memory-based solution, then chances are that by the time you start to approach its memory limits, it will be both feasible and cost effective to simple upgrade the hardware and avoid the need to move to a disk-based solution at all.

      All of which may or may not apply here, but it maybe explains my suspicions. Now back to the problem.


      The solution that sprang to mind--ignoring the scale of the problem--involves reading each file once.

      1. Read the jobs(**) file, build a HoA's, keyed by timeslot (1..N), and push the id of each job that is active during that time slot.

        Thus, for each timeslot you have a list of all the active jobs.

      2. Read the users(**) file.
        1. For each user record:
        2. for each timeslot they are active.

          Lookup that timeslot in the jobs hash and output a record saying this user could have been "responsible for" these jobs during this timeslot.

        (**)Your description of the problem is vague (in many ways!), in that you don;t actually state what information you are trying to extract. I've gone for Jobs/User in each timeslot. Reverse the order you process the file and you get Users/Job in each timeslot. Many consolidations are also possible,

      Thus you get an O( J + U ) solution.

      For your sample data, this produces the following in a matter of seconds:

      timeslot: 14230157 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230158 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230159 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230160 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230161 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230162 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230163 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230164 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230165 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230166 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230167 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230168 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230169 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230170 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230171 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230172 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230173 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230174 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230175 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230176 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230177 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230178 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230179 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230180 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230181 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230182 user: 4:7:3:784:575 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 12044820 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044821 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044822 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044823 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044824 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044825 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044826 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044827 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044828 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044829 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044830 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044831 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044832 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044833 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044834 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044835 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044836 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044837 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044838 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044839 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044840 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044841 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044842 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044843 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044844 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 12044845 user: 34:7:3:516:921 overlaps with job ids: 1 +27303:7:3:686:44 timeslot: 14230157 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230158 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230159 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230160 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230161 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230162 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230163 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230164 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230165 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230166 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230167 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230168 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230169 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230170 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230171 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230172 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230173 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230174 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230175 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230176 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230177 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230178 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230179 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230180 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230181 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14230182 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:707:620 timeslot: 14593103 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593104 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593105 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593106 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593107 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593108 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593109 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593110 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593111 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593112 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593113 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593114 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593115 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593116 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593117 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593118 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593119 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593120 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593121 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593122 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593123 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593124 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593125 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593126 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593127 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 14593128 user: 34:7:3:516:921 overlaps with job ids: 3 +445:7:3:537:287 timeslot: 16948765 user: 34:7:3:516:921 overlaps with job ids: 1 +27305:7:3:49:800 timeslot: 16948766 user: 34:7:3:516:921 overlaps with job ids: 1 +27305:7:3:49:800 timeslot: 16948767 user: 34:7:3:516:921 overlaps with job ids: 1 +27305:7:3:49:800 timeslot: 16948768 user: 34:7:3:516:921 overlaps with job ids: 1 +27305:7:3:49:800 timeslot: 17007683 user: 34:7:3:516:921 overlaps with job ids: 1 +27439:7:3:481:403 timeslot: 17007684 user: 34:7:3:516:921 overlaps with job ids: 1 +27439:7:3:481:403 timeslot: 17007685 user: 34:7:3:516:921 overlaps with job ids: 1 +27439:7:3:481:403 timeslot: 19408728 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408729 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408730 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408731 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408732 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408733 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408734 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408735 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408736 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408737 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408738 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408739 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408740 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408741 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408742 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408743 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408744 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408745 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408746 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408747 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408748 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408749 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408750 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408751 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408752 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614 timeslot: 19408753 user: 34:7:3:516:921 overlaps with job ids: 1 +27438:7:3:508:614

      You'll see that there is a lot of duplication in there. Many identical records (apart from timeslot) are produced sequentially. These are easily consolidated as they are produced to give:

      timeslots: 12491787 .. 14230157 user: 4:7:3:784:575 overlaps wi +th job ids: 3445:7:3:707:620 timeslots: 11671837 .. 12044820 user: 34:7:3:516:921 overlaps wi +th job ids: 127303:7:3:686:44 timeslots: 12044820 .. 14230157 user: 34:7:3:516:921 overlaps wi +th job ids: 3445:7:3:707:620 timeslots: 14230157 .. 14593103 user: 34:7:3:516:921 overlaps wi +th job ids: 3445:7:3:537:287 timeslots: 14593103 .. 16948765 user: 34:7:3:516:921 overlaps wi +th job ids: 127305:7:3:49:800 timeslots: 16948765 .. 17007683 user: 34:7:3:516:921 overlaps wi +th job ids: 127439:7:3:481:403 timeslots: 17007683 .. 19408728 user: 34:7:3:516:921 overlaps wi +th job ids: 127438:7:3:508:614

      Which is more manageable.

      But, now we have to consider the scale as stated. Building up a single hash of all the jobs active within every one of those 20e6 timeslots would take a huge amount of memory.

      However, because timeslots are the constant in the processing of the two files, it is relatively easy to perform the processing in multiple passes. Each pass deals only with 20e6/no-of-passes of the timeslots, giving an effective mechanism for controlling the memory usage of each pass.

      This scales linearly(***) giving an O( ( J + U ) * P ) algorithm. Which will probably be much quicker than a RDBMS 3-way join, which even with optimisations is likely to be at least O( J*U ), and very likely O( J*U*p ) where the p is a hidden multiplier based on some arbitrary, application agnostic quantity based not on available memory, but rather some 'blocking factor' used in the DBMs file handling.

      (***)As you get more memory, you can reduce the number of passes and get a linear speed-up. If you have multiple boxes (neither processes nor threads will do, because physical memory is the constraint), you can assign different passes to different boxes and again get a linear speed-up.

      Of course, there are several other ways of grouping the results that may be more applicable to your (unstated) needs.

      And that brings me back to "As i said i tried to simplify the problem as best as i could". The problem with simplifying problem descriptions this way, is that in this kind of brute-scale processing, it is often application specific details that hold the best promise of application specific optimisations.

      Seemingly insignificant details, say: that job-ids are never reused; or user sessions never last more than 500 timeslots at a time; or similar, can provide for way of reducing storage requirements, or short-circuiting loops that can have a significant impact upon the processing time required.

      Don't take the O() notations too seriously, they omit details.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.