Re: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

Responding to your "uber-update"...

Now, I don't have 80,000 machines.
But I do have a single machine that can run multiple processes.
But I hate writing code that does this, because threads are painful, forks are painful, you get race conditions, you have to use locks...

Aside from the pains you cite, there is also the inescapable truth that if you try to do more with just the one machine, you will hit a limit -- a plateau -- beyond which further "parallelizing" will hurt rather than help.

Whether your task is mainly i/o bound, or memory bound, or cpu bound, adding more instances of the task will, at some point, exacerbate the load on the given resource to the point where improvements are not only impossible, but negated.

Maybe running three instances in parallel will be faster, overall, than running two-at-once then one, but maybe running four at once will be slower than running two parallelized pairs in sequence. YMMV.

Comment on Re: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

Replies are listed 'Best First'.
Re^2: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? by tphyahoo (Vicar) on Oct 20, 2006 at 17:51 UTC
Yes, no doubt this is all true. But my point is that by hiding my parallelization code in a threadedreduce, and putting the "meat" of my program in a "function builder" such as grepbuilder / mapbuilder / urlgrabbuilder -- that relies on threadedreduce or nonthreadedreduce -- I make experimentation easier, and maintenance easier. I also write code that is easily portable to running on more than one thread, or more than machine. Just change one line, and see if you get an improvement. If there's no improvement, or things actually get worse, switch back to nonthreadreduce / nondistributedreduce. These days, feels to me like I'm always second guessing myself with my thread code, rather than concentrating on the meat of my program.	[reply]

Replies are listed 'Best First'.

Re^2: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?
by tphyahoo (Vicar) on Oct 20, 2006 at 17:51 UTC

But my point is that by hiding my parallelization code in a threadedreduce, and putting the "meat" of my program in a "function builder" such as grepbuilder / mapbuilder / urlgrabbuilder -- that relies on threadedreduce or nonthreadedreduce -- I make experimentation easier, and maintenance easier.

I also write code that is easily portable to running on more than one thread, or more than machine. Just change one line, and see if you get an improvement. If there's no improvement, or things actually get worse, switch back to nonthreadreduce / nondistributedreduce.

These days, feels to me like I'm always second guessing myself with my thread code, rather than concentrating on the meat of my program.

[reply]