I'm always suspicious of large round numbers. Even more so of coincident large round numbers.
All too frequently, these are an indication that some "future-proofing" guestimation has been brought to bear upon the real problem.
We're currently dealing with nK items/time period--but we're a young, go-get'em company in our first year, and we'll have exponential growth over the next 20 quarters, so we have to plan for: nK * 1.over--optimistic--guess20. Rounded up to the next 'nice' number of course.
And in the process, what could be a nice, fast, self-contained, memory-resident algorithm becomes a messy, laborious, 1000s-of-times-slower, disk-based solution.
It may seem sensible at first glance to "plan for the future" in this way--but then reality kicks in. Usually, the guestimates are way optimistic, but even if the final numbers are in the right ballpark, coding for them now does you a disservice.
This because planning for those final numbers now--on your current hardware--fails to account for the fact that by the time you reach those numbers, you'll have gone through at least one, if not 2 upgrades of hardware. That's if you haven't been taken over, sold out or gone bust.
And with each new hardware upgrade, the limits--and cost--of ram roughly double/halve respectively. Whereas the speed of disk remains relatively stagnant(*), meaning that the penalties of disk-based solutions effectively more than double.
(*) SSD notwithstanding. Whilst they are (say) an order of magnitude faster than HDD, they are still (say) 2 orders of magnitude slower than memory.
If, realistically, your current hardware can see you through the next 18/24 months with a memory-based solution, then chances are that by the time you start to approach its memory limits, it will be both feasible and cost effective to simple upgrade the hardware and avoid the need to move to a disk-based solution at all.
All of which may or may not apply here, but it maybe explains my suspicions. Now back to the problem.
The solution that sprang to mind--ignoring the scale of the problem--involves reading each file once.
- Read the jobs(**) file, build a HoA's, keyed by timeslot (1..N), and push the id of each job that is active during that time slot.
Thus, for each timeslot you have a list of all the active jobs.
- Read the users(**) file.
- For each user record:
- for each timeslot they are active.
Lookup that timeslot in the jobs hash and output a record saying this user could have been "responsible for" these jobs during this timeslot.
(**)Your description of the problem is vague (in many ways!), in that you don;t actually state what information you are trying to extract. I've gone for Jobs/User in each timeslot. Reverse the order you process the file and you get Users/Job in each timeslot. Many consolidations are also possible,
Thus you get an O( J + U ) solution.
For your sample data, this produces the following in a matter of seconds:
You'll see that there is a lot of duplication in there. Many identical records (apart from timeslot) are produced sequentially. These are easily consolidated as they are produced to give:
timeslots: 12491787 .. 14230157 user: 4:7:3:784:575 overlaps wi
+th job ids: 3445:7:3:707:620
timeslots: 11671837 .. 12044820 user: 34:7:3:516:921 overlaps wi
+th job ids: 127303:7:3:686:44
timeslots: 12044820 .. 14230157 user: 34:7:3:516:921 overlaps wi
+th job ids: 3445:7:3:707:620
timeslots: 14230157 .. 14593103 user: 34:7:3:516:921 overlaps wi
+th job ids: 3445:7:3:537:287
timeslots: 14593103 .. 16948765 user: 34:7:3:516:921 overlaps wi
+th job ids: 127305:7:3:49:800
timeslots: 16948765 .. 17007683 user: 34:7:3:516:921 overlaps wi
+th job ids: 127439:7:3:481:403
timeslots: 17007683 .. 19408728 user: 34:7:3:516:921 overlaps wi
+th job ids: 127438:7:3:508:614
Which is more manageable.
But, now we have to consider the scale as stated. Building up a single hash of all the jobs active within every one of those 20e6 timeslots would take a huge amount of memory.
However, because timeslots are the constant in the processing of the two files, it is relatively easy to perform the processing in multiple passes. Each pass deals only with 20e6/no-of-passes of the timeslots, giving an effective mechanism for controlling the memory usage of each pass.
This scales linearly(***) giving an O( ( J + U ) * P ) algorithm. Which will probably be much quicker than a RDBMS 3-way join, which even with optimisations is likely to be at least O( J*U ), and very likely O( J*U*p ) where the p is a hidden multiplier based on some arbitrary, application agnostic quantity based not on available memory, but rather some 'blocking factor' used in the DBMs file handling.
(***)As you get more memory, you can reduce the number of passes and get a linear speed-up. If you have multiple boxes (neither processes nor threads will do, because physical memory is the constraint), you can assign different passes to different boxes and again get a linear speed-up.
Of course, there are several other ways of grouping the results that may be more applicable to your (unstated) needs.
And that brings me back to "As i said i tried to simplify the problem as best as i could". The problem with simplifying problem descriptions this way, is that in this kind of brute-scale processing, it is often application specific details that hold the best promise of application specific optimisations.
Seemingly insignificant details, say: that job-ids are never reused; or user sessions never last more than 500 timeslots at a time; or similar, can provide for way of reducing storage requirements, or short-circuiting loops that can have a significant impact upon the processing time required.
Don't take the O() notations too seriously, they omit details.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|