Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

Okay, I'm revising my post a bit. Originally I just wanted to hear people's experiences with the recently-posted-to-cpan perl MapReduce.

But now, I'm asking a more general type question.

Basically, I'm thinking along the same lines as Joel in Can Your Programming Language Do This?

The MapReduce algo used by google is an abstraction which they use to simplify code that needs to run in a massively parallel way. A simple example of this would be Distributed Grep, which works the same way as grep, but takes advantage of the fact that you have 80,000 machines at your disposal to speed things up a bit. Same thing with Distributed Sort. Etc, etc.

Now, I don't have 80,000 machines.

But I do have a single machine that can run multiple processes.

But I hate writing code that does this, because threads are painful, forks are painful, you get race conditions, you have to use locks... arrrg.

So, what I would like to do is hide my parallel (threaded/forked/whatever) code in a functional way using MapReduce, and hide all the complexity of the parallelization in my personal MapReduce module.

If I happen to have 80,000 computers (I'm a spam king with a zombie horde... well, bad examples), I would use DistributedMapReduce and run code on multiple computers. If I only have a single computer, but have some extra cpu cycles and a place where it seems that threading cpeed things up, I would use ThreadedMapReduce.

Maybe both these modules could inherit from MapReduce in a sane way.

Does this make some kind of sense, or am I totally barking up the wrong tree?

If it makes sense, I say: let's build itand take some of the pain out of threading.

By the way, that last link was to a thread where BrowserUK -- who I consider sort of a threading guru -- writes some sample code to do something simple in a parallel way, but gets corrected by ikegami because he forgot to use locks. What I want is a kind of universe where this couldn't happen, because a "parallelize me" module in the background takes care of the messy details. Whether with threads, or distributed computers, or whatever.

UPDATE: Before someone else points this out, I am aware that adding threads to a slow running program doesn't always speed things up. Sometimes it does, sometimes no. It depends where your bottleneck is.

For example, my gut feeling is that ThreadedMRGrep (running through ThreadedMapReduce) might be a little faster than grep for a long list, but ThreadedSort wouldn't be much of a problem, because you'd hit the same computational bottleneck you hit with non-threaded grep. But I admit lack of experience and could be wrong here. But the example of threadedly downloading a bunch of web site and grepping against desired values in them would *definitely* be faster when run through ThreadedMapReduce. So, what I'm after really here is a framework for reducing program complexity, a powerful abstraction that can yield dividends when used with care.

And of course, I'd like it in perl :)

**********

ORIGINAL POST: I'm experimenting with functional programming, and stumbled across MapReduce, piping fresh on CPAN.

Has anyone noticed and/or tried this module?

This was wishlisted at perlmonks a couple of years ago:

Google's MapReduce

Now I'm wondering if I could use this technology to do something simple in a parallel way, like... using parallel processing to concatenate a string, where order of concatenation doesn't matter

UPDATE: I changed the title from "Has anyone noticed and/or tried the MapReduce recently uploaded to CPAN?"

UPDATE 2: After kicking this around some, I concluded that for my personal needs, just plain ThreadedReduce would be fine, and I think this is a bit of magic that would help lots of people and not just me. So, I am leaving ThreadedMap for another day, but I got started writing stub code for ThreadedReduce, including a failing test, further down in this thread: Re^2: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)?. If someone can make my "not ok" an "ok" they will be my hero for today.

UPDATE 3: As a side note, I'm slowly starting to think that map (in the perl sense) can actually be built on top of reduce with functional programming, just like in my stub below, grep is built functionally on top of reduce. So, if you get parallelReduce, you also get parallelMap built on top of that with no mess. I still need to verify that by actually building map out of reduce, or having someone chime in that yes, that's right. But I'm almost 99% percent sure...

UPDATE 4: With regards to function builders, that last bit was not quite right. There really needs to be two base functions to build other functions out of. In terms of the "Map/Reduce" nomenclature, "Map" processes a set, not necessarily in order. This can be parallelized Reduce processes an array, or a vector, and order matters. This can't be parallelized -- of if it can, somebody please tell me how..

So for example, I could build ParallelGrep on top of ParallelMap, but not ParallelConcatenate. Concatenate would have to be built on top of reduce. And ParallelReduce as a function builder would be impossible.

ParallelSort could be built on top of Parallelmap, because the cmp function for any two elements doesn't care about the other elements. I'm not sure if that would be a good idea, but I guess I could try it to see with little work if I had ParallelMap working.

So, as I see it now, the kernel of this saga boils down to the task of writing ParallelMap. But I'm writing another post about that, which shall hopefully be posted shortly.

Comment on Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

Replies are listed 'Best First'.
Re: Has anyone noticed and/or tried the MapReduce recently uploaded to CPAN? by gaal (Parson) on Oct 20, 2006 at 09:47 UTC
I've not tried this module out. But if functional programming is what you're looking for, MapReduce isn't exactly it. It's more of a large-scale job control framework for computations that are what a functional program would be more likely to be designed upon (that is, pure functions). If you want to experience the kinds of tools that are associated with more day-to-day stuff, like using (probably lazy) lists, abstract datatypes, pattern matching, type inference, curried functions, composition, folds etc., this isn't where you'll find them. Two CPAN modules for Perl 5 that bring some of the goodies (incidentally both at version 0.03 right now) are fp and Language::Functional.	[reply]
Re: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)? by Molt (Chaplain) on Oct 20, 2006 at 11:19 UTC
I haven't looked into MapReduce myself so this may not be what you're after, but if you're trying to take the pain out of threading I'd recommend having a look at Parallel::ForkManager. I've used it in the past and it seems sufficiently robust.	[reply]
Re^2: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)? by tphyahoo (Vicar) on Oct 20, 2006 at 11:23 UTC
I originally started with P::FM as well, because it did look simple. But this uses forks, not threads, and so couldn't be applied to my problem. (Specifically, concatenating a string in a distributed way, where order wasn't important: using parallel processing to concatenate a string, where order of concatenation doesn't matter.) It didn't work, in a way that violated my expectations, because parallelization, like I said, is painful. In this case I would have needed to use threads. Perhaps there could be a ForkedMapReduce as well as a ThreadedMapReduce. The important thing to me is a reliable abstraction that Does What I Mean.	[reply]
Re^3: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)? by Anonymous Monk on Oct 20, 2006 at 15:50 UTC
The important thing to me is a reliable abstraction that Does What I Mean. What's wrong with this?	[reply]
Re^4: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)? by tphyahoo (Vicar) on Oct 20, 2006 at 16:16 UTC
Re^5: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)? by BrowserUk (Patriarch) on Oct 20, 2006 at 20:14 UTC
Re^2: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)? by tphyahoo (Vicar) on Oct 20, 2006 at 16:07 UTC
Okay, I have here stub code for a parallel "threadedreduce", using what I think is pretty much functional style programming. My goal is along the same lines as my post yesterday concerning concating letters (mentioned in op), but more honed. Basically, I have a slow function for demo purposes, and I want to speed it up by using code executed in parallel; but in such a way that I can abstract the ugly bigs (locks, threads, etcetera) into a module that I maintain separately. So, here, my function in in the coderef: `$slow_matches_b` It's slow because it sleeps for one second. And what it does it, it returns true if the input matches the letter b. I then have `sub slowgrep`, which should do the same thing that grep does, but in a more explicit way. `slowgrep` is built on `sub reduce`. This function is similar to the `reduce` you get in List::Util, but like Frank Sinatra, here, I did it my way. Could be it's not quite the same. I actually modeled it after the javascript reduce example code in the joelonsoftware essay Can Your Programming Language Do This?. Reduce is one of the basic building blocks of functional programming, and half of the google's celebrated MapReduce. This particular implementation of Reduce executes serially. Then I have the `sub fastgrep`, which should also do the same thing that grep does, but be faster because it uses some form of parallelization. Every line of this function is the same as slowgrep, except it is based on `threadedreduce` instead of `reduce`. `threadedreduce` should do the same thing as `reduce` except... it doesn't. As the tests show. In fact, it is just a stub, and actually it has forks not threads, but you get the idea. I think this function could be implemented either with forks, or with threads, or with POE, or whatever your favorite technique for cranking up parallelization in your environment. The point is that it lets programmers process lists in a parallel way while hiding away the complexity of the parallelized code. This is where I am hoping the monastery can step in and help me out. Can someone out there make `threadedreduce` work? `threadedreduce` could then be used to build threadedgrep, threadedUserAgent (same as UserAgent::Parallel, but you can WWW::Mechanize as the UA instead of the UA inherited from LWP::UserAgent)... or... well... lots of things. And in a way that's beautiful, functional, and maintainable. $ ./test_reduce.pl not ok 1 - parallel-y executing code works # Failed test 'parallel-y executing code works' # in ./test_reduce.pl at line 15. fast matches: $VAR1 = []; ok 2 - serially executing code works slow matches: $VAR1 = [ 'blee', 'blah', 'bloo' ]; 1..2 # Looks like you failed 1 test of 2. $ cat test_reduce.pl #!/usr/bin/perl use strict; use warnings; use Test::More qw( no_plan ); use Data::Dumper; use Parallel::ForkManager; my $slow_matches_b = sub { sleep 1; return 1 if $_[0] =~ /b/; }; my $test_strings = [ ('blee','blah','bloo', 'qoo', 'fwee' ) ]; my $fast_matches = fastgrep( $slow_matches_b, $test_strings ); ok( @$fast_matches, "parallel-y executing code works" ); print "fast matches: " . Dumper($fast_matches); # should dump out blee, blah bloo, but not fwee or qoo my $slow_matches = slowgrep( $slow_matches_b, $test_strings ); ok( @$slow_matches, "serially executing code works" ); print "slow matches: " . Dumper($slow_matches); sub fastgrep { my $test_function = shift; my $array = shift; my $grep_builder = sub { my $matches = shift; my $test_el = shift; push @$matches, $test_el if $test_function->($test_el); return $matches; }; return threadedreduce ( $grep_builder, $array, []) } sub slowgrep { my $test_function = shift; my $array = shift; my $grep_builder = sub { my $matches = shift; my $test_el = shift; push @$matches, $test_el if $test_function->($test_el); return $matches; }; return reduce( $grep_builder, $array, []) } # just a stub, hoping someone can help me with the threading sub threadedreduce { my $function = shift; my $array = shift; my $init = shift; my $pm=new Parallel::ForkManager(10); my $result = $init; for my $el ( @$array) { $pm->start and next; $result = $function->( $result, $el); $pm->finish; } $pm->wait_all_children; return $result; } sub reduce { my $function = shift; my $array = shift; my $init = shift; my $result = $init; for my $el ( @$array) { $result = $function->( $result, $el) } return $result; } [download] UPDATE: I'm actually thinking the promising avenue of approach may to use Parallel::Queue, along the lines diotalevi suggested yesterday in Re: using parallel processing to concatenate a string, where order of concatenation doesn't matter.	[reply] [d/l] [select]
Re: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? by harryf (Beadle) on Oct 20, 2006 at 13:10 UTC
Side note: there's another attempt at something like MapReduce in Perl called DMap, written by Brad Fitzpatrick and built on Gearman. As far as I know, not available in packaged form - some links; http://code.sixapart.com/svn/gearman/trunk/doc/overview.txt http://brad.livejournal.com/2106943.html http://brad.livejournal.com/2210265.html That's distributed again though although I guess it would be easy to run the whole lot on a single system.	[reply]
Re: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? by chromatic (Archbishop) on Oct 20, 2006 at 18:46 UTC
Unless something has really changed within Perl recently, ithreads won't do what you want here, as they are not OS threads. You'll still have one process on one processor.	[reply]
Re^2: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? by tphyahoo (Vicar) on Oct 20, 2006 at 18:49 UTC
So maybe I would need to use forks, and store part of the computation on the hard drive, in some temp directory? That's ok by me, as long as the abstraction is hidden away and reduceForked( $functionBuilder, $arrayref, $initvalue ) always does what I mean. I want this. Bad :)	[reply]
Re: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? by graff (Chancellor) on Oct 20, 2006 at 17:46 UTC
Responding to your "uber-update"... Now, I don't have 80,000 machines. But I do have a single machine that can run multiple processes. But I hate writing code that does this, because threads are painful, forks are painful, you get race conditions, you have to use locks... Aside from the pains you cite, there is also the inescapable truth that if you try to do more with just the one machine, you will hit a limit -- a plateau -- beyond which further "parallelizing" will hurt rather than help. Whether your task is mainly i/o bound, or memory bound, or cpu bound, adding more instances of the task will, at some point, exacerbate the load on the given resource to the point where improvements are not only impossible, but negated. Maybe running three instances in parallel will be faster, overall, than running two-at-once then one, but maybe running four at once will be slower than running two parallelized pairs in sequence. YMMV.	[reply]
Re^2: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? by tphyahoo (Vicar) on Oct 20, 2006 at 17:51 UTC
Yes, no doubt this is all true. But my point is that by hiding my parallelization code in a threadedreduce, and putting the "meat" of my program in a "function builder" such as grepbuilder / mapbuilder / urlgrabbuilder -- that relies on threadedreduce or nonthreadedreduce -- I make experimentation easier, and maintenance easier. I also write code that is easily portable to running on more than one thread, or more than machine. Just change one line, and see if you get an improvement. If there's no improvement, or things actually get worse, switch back to nonthreadreduce / nondistributedreduce. These days, feels to me like I'm always second guessing myself with my thread code, rather than concentrating on the meat of my program.	[reply]
Re: Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? by tphyahoo (Vicar) on Oct 22, 2006 at 19:26 UTC
I wrote Ira Woodhead (the author of MapReduce): Ira, I'm real interested by that module you recently posted. I am having to write a lot of code that works in a parallel way, and I hate it. What is most appealing to me about MapReduce is not so much that it allows you to split things up on a cluster (though that's great), but that the code itself is so clean. No locks, no guards, no name-your-poison. Just Reader/Map/Reduce. Simple. Clean. All the dirt is hidden away, where your rocket scientists can optimize things in your custom MapReduce code to their hearts content. I posted about this at Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce? What I'm wondering is if you could have this goodness working even when you just have one computer, by using threads/forks. I'm thinking ThreadedMapReduce / ForkedMapReduce. Maybe even all inheriting from your original module in a sane way. This would do the same thing that ParallelForkManager, and Thread::Queue, and countless other modules, but with a much cleaner functional interface. I'm thinking along the same lines as Joel in "Can Your Programming Language Do This?": http://www.joelonsoftware.com/items/2006/08/01.html Well, can perl do this? What do you think? :) Best, thomas. **************************************************** **************************************************** Ira Woodhead responded: Actually, yes! The version I'm currently getting ready to post has two config options "maps" and "reduces" which allow you to set the number of map task and reduce task processes per machine. To use it on a single machine you simply have a single-machine cluster. If you want threads, well, sorry but I'm not comfortable using perl threads, having experienced some pitfalls with them. Not to mention they require a recompile, and I really want to keep this simple to deploy. The relatively more expensive full forks are the way I'm going, at least for now. I'm glad there is some interest in this. I was quite surprised to find no effort underway besides Hadoop to implement this model. And yes, I did see JSpolsky's post about mapreduce, it's a nice way of explaining it. Currently I'm trying to figure out if there's a way to make the deployment even easier, and, more significantly, to somehow allow mapreduce operations to take place inside of a program, rather than just being a separate utility, so that multiple such operations can be used as part of a larger system. Right now it's one invocation of a command line "mr" program per operation, but wouldn't it be nice as more of a language feature? Anyway, I'll be uploading the next version soon. Let me know if you have any comments or ideas. Cheers! **************************************************** **************************************************** I responded: Ira, I think it's great what you're doing. The feature to run a controlled "cluster" on a single computer could potentially pay dividends for a lot of people, both in functionality and maintainability. The way I see it, it's basically threading/forking with the dirty bits hidden away, in a way that will scale. My thoughts / idea, and links to the evolution thereof, are well summarized in a perlmonks post I just made Using functional programming to reduce the pain of parallel-execution programming (with threads, forks, or name your poison) If I could plug MapReduce into the "hashmap_parallel" function builder in the Map.pm code there (buried in a readmore), I would have my pony. **************************************************** **************************************************** To sum it up, there should be a way of doing what I want pretty soon, using the option for a single machine cluster described by Ira above. Kudos to Ira for putting this together, and keeping the improvements coming.	[reply]