in reply to Re^2: mapping coordinates- suggestion needed
in thread mapping coordinates- suggestion needed

Let's see. If the "coordinates" of the time intervals are in seconds, 20e6 covers about 8 months, which doesn't really tally with your "those datasets have piled up over the years".

But if they were in minutes, then it represents 39 years. Why would anyone care what user ran what job 38 years ago? They've an above average chance of being dead already.

And then there is the "rules are that even if only part of the job_id_interval crosses the uname interval, this should be reported." bit. How can a userid be responsible for a particular job if they didn't log on until the last (second|minute|lesser known time unit) of the jobs life?

And then there's that coincidence between there being 20e6 logons, 20e6 jobs; & 20e6 intervals during which those logos and job runs occurred?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy
  • Comment on Re^3: mapping coordinates- suggestion needed

Replies are listed 'Best First'.
Re^4: mapping coordinates- suggestion needed
by baxy77bax (Deacon) on Oct 15, 2010 at 06:17 UTC
    ok,

    i tried to make the problem as simple as possible. the lines(coordinates for the job id's and uname's) are fair share time coefficients.

    at one point i was so pissed with sge fair-share system, priority list and job dispatcher that i wrote simmilar engine for myself. and this thing has now been running for a 1,5 years.

    reason i post this question is because you guys always manage to come up with a more elegant solution (and i'm not looking for an exact code, just a suggestion)

    And then there's that coincidence between there being 20e6 logons, 20e6 jobs; & 20e6 intervals during which those logos and job runs occurred?

    yes, logical observation and the answer would be obvious if i explained how exactly the grid engine works. but there would be too many if's and then's if i explained everything.As i said i tried to simplify the problem as best as i could

    cheers

      I'm always suspicious of large round numbers. Even more so of coincident large round numbers.

      All too frequently, these are an indication that some "future-proofing" guestimation has been brought to bear upon the real problem.

      We're currently dealing with nK items/time period--but we're a young, go-get'em company in our first year, and we'll have exponential growth over the next 20 quarters, so we have to plan for: nK * 1.over--optimistic--guess20. Rounded up to the next 'nice' number of course.

      And in the process, what could be a nice, fast, self-contained, memory-resident algorithm becomes a messy, laborious, 1000s-of-times-slower, disk-based solution.

      It may seem sensible at first glance to "plan for the future" in this way--but then reality kicks in. Usually, the guestimates are way optimistic, but even if the final numbers are in the right ballpark, coding for them now does you a disservice.

      This because planning for those final numbers now--on your current hardware--fails to account for the fact that by the time you reach those numbers, you'll have gone through at least one, if not 2 upgrades of hardware. That's if you haven't been taken over, sold out or gone bust.

      And with each new hardware upgrade, the limits--and cost--of ram roughly double/halve respectively. Whereas the speed of disk remains relatively stagnant(*), meaning that the penalties of disk-based solutions effectively more than double.

      (*) SSD notwithstanding. Whilst they are (say) an order of magnitude faster than HDD, they are still (say) 2 orders of magnitude slower than memory.

      If, realistically, your current hardware can see you through the next 18/24 months with a memory-based solution, then chances are that by the time you start to approach its memory limits, it will be both feasible and cost effective to simple upgrade the hardware and avoid the need to move to a disk-based solution at all.

      All of which may or may not apply here, but it maybe explains my suspicions. Now back to the problem.


      The solution that sprang to mind--ignoring the scale of the problem--involves reading each file once.

      1. Read the jobs(**) file, build a HoA's, keyed by timeslot (1..N), and push the id of each job that is active during that time slot.

        Thus, for each timeslot you have a list of all the active jobs.

      2. Read the users(**) file.
        1. For each user record:
        2. for each timeslot they are active.

          Lookup that timeslot in the jobs hash and output a record saying this user could have been "responsible for" these jobs during this timeslot.

        (**)Your description of the problem is vague (in many ways!), in that you don;t actually state what information you are trying to extract. I've gone for Jobs/User in each timeslot. Reverse the order you process the file and you get Users/Job in each timeslot. Many consolidations are also possible,

      Thus you get an O( J + U ) solution.

      For your sample data, this produces the following in a matter of seconds:

      You'll see that there is a lot of duplication in there. Many identical records (apart from timeslot) are produced sequentially. These are easily consolidated as they are produced to give:

      timeslots: 12491787 .. 14230157 user: 4:7:3:784:575 overlaps wi +th job ids: 3445:7:3:707:620 timeslots: 11671837 .. 12044820 user: 34:7:3:516:921 overlaps wi +th job ids: 127303:7:3:686:44 timeslots: 12044820 .. 14230157 user: 34:7:3:516:921 overlaps wi +th job ids: 3445:7:3:707:620 timeslots: 14230157 .. 14593103 user: 34:7:3:516:921 overlaps wi +th job ids: 3445:7:3:537:287 timeslots: 14593103 .. 16948765 user: 34:7:3:516:921 overlaps wi +th job ids: 127305:7:3:49:800 timeslots: 16948765 .. 17007683 user: 34:7:3:516:921 overlaps wi +th job ids: 127439:7:3:481:403 timeslots: 17007683 .. 19408728 user: 34:7:3:516:921 overlaps wi +th job ids: 127438:7:3:508:614

      Which is more manageable.

      But, now we have to consider the scale as stated. Building up a single hash of all the jobs active within every one of those 20e6 timeslots would take a huge amount of memory.

      However, because timeslots are the constant in the processing of the two files, it is relatively easy to perform the processing in multiple passes. Each pass deals only with 20e6/no-of-passes of the timeslots, giving an effective mechanism for controlling the memory usage of each pass.

      This scales linearly(***) giving an O( ( J + U ) * P ) algorithm. Which will probably be much quicker than a RDBMS 3-way join, which even with optimisations is likely to be at least O( J*U ), and very likely O( J*U*p ) where the p is a hidden multiplier based on some arbitrary, application agnostic quantity based not on available memory, but rather some 'blocking factor' used in the DBMs file handling.

      (***)As you get more memory, you can reduce the number of passes and get a linear speed-up. If you have multiple boxes (neither processes nor threads will do, because physical memory is the constraint), you can assign different passes to different boxes and again get a linear speed-up.

      Of course, there are several other ways of grouping the results that may be more applicable to your (unstated) needs.

      And that brings me back to "As i said i tried to simplify the problem as best as i could". The problem with simplifying problem descriptions this way, is that in this kind of brute-scale processing, it is often application specific details that hold the best promise of application specific optimisations.

      Seemingly insignificant details, say: that job-ids are never reused; or user sessions never last more than 500 timeslots at a time; or similar, can provide for way of reducing storage requirements, or short-circuiting loops that can have a significant impact upon the processing time required.

      Don't take the O() notations too seriously, they omit details.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.