Google's MapReduce

Replies are listed 'Best First'.
Re: Google's MapReduce by Anonymous Monk on Oct 27, 2004 at 09:10 UTC
Not quite sure what you mean by the "basic functions", but `map` is a Perl primitive, and `reduce` is found in `List::Util`. But the paper isn't about map or reduce. It's about distributing work over a large set of unreliable (read 'cheap') computers. Since the work is distributed over stand-alone computers, which sit in a ethernet network it means you don't have any significant I/O. (Oh, sure, you've got lots of TCP/IP bandwidth, but that's no where near enough to share memory (or even disks)). This means you can only successfully distribute work that can be divided into parts, where each part can be worked on without needed (intermediate) results of works on other parts. That's where the map/reduce comes in. It separates the work that can be done independently (map) from work that can't be done independently (reduce). For instance, calculating the total number of links in a set of documents. Parsing each document and calculating the number of links in a single document can be done independently from parsing any other document (map phase). Note also that parsing a document is a task that can easily be restarted, without having to undo any other work - an important aspect for google, due to the nature of their infrastructure. The summing of all individual counts (reduce phase) can't be done independently (cause you need all results), but that's a relative small task compared to the parsing. (But you might recurse and sum the totals of groups of documents, then collect the results). I've no doubt that you can write the implementation of MapReduce in Perl. (No doubt countless many people have already implemented this on a much smaller scale using threads on a single computer, but without the tolerance of failing components). I'm not sure if you want to though.	[reply] [d/l] [select]
Re: Google's MapReduce by radiantmatrix (Parson) on Oct 27, 2004 at 16:09 UTC
I don't know in which programming language MapReduce was written On page 13 of the PDF you linked to, I find: `#include "mapreduce/mapreduce.h"` [download] So I'm guessing one of the C variants. could you write the basic functions in Perl? Well, yes. Nothing about Perl limits your ability to write MapReduce: you have robust I/O, complex computational ability, networking capabilities, and comprehensive tools to manipulate data structures. I think the real question, given the low-level and high-performance nature of some of the MapReduce requirements, is "would a Perl version of this be fast enough?" Perl's runtime compiler and VM are pretty damn good; but, they still can't (and shouldn't try to) compete with C for low-level system operations. It is those kinds of operations that make MapReduce efficient for the volumes of data Google deals with. Of course, if you needn't process terabytes of data in one go, Perl might perform acceptably, and you'd have all of Perl's advantages in exchange for waiting a bit. I think it's a worthy project to implement something like this in Perl. At the very least, it would be an interesting benchmark opportunity. Perl is a great language for prototyping, so even if it's too slow for practical use, having an open Perl implementation of MapReduce provides a great reference for porting to "faster" languages. radiantmatrix `require General::Disclaimer;` "Users are evil. All users are evil. Do not trust them. Perl specifically offers the -T switch because it knows users are evil." - japhy	[reply] [d/l]
Re^2: Google's MapReduce by Fletch (Bishop) on Oct 27, 2004 at 16:38 UTC
Erm, read harder guys. :) Section 2.1 third paragraph. . . . The user's code is linked together with the MapReduce library (implemented in C++). . . .	[reply]
Re^3: Google's MapReduce by radiantmatrix (Parson) on Nov 01, 2004 at 15:15 UTC
Read?! But that would mean I wasn't exercising the Noble Virtues of laziness (why read all of something?), impatience (I don't have time to read 13 pages of detail, just give me the code!), and hubris (besides, I already know everything.). ;-P. j/k, of course... radiantmatrix `require General::Disclaimer;` "Users are evil. All users are evil. Do not trust them. Perl specifically offers the -T switch because it knows users are evil." - japhy	[reply]
Re: Google's MapReduce by SpanishInquisition (Pilgrim) on Oct 27, 2004 at 13:18 UTC
I don't know in which programming language MapReduce was written, but could you write the basic functions in Perl? It's Turing Complete.	[reply]
Re^2: Google's MapReduce by hardburn (Abbot) on Oct 27, 2004 at 13:34 UTC
TC deals with computation in terms of functions. Anything with I/O is pretty much outside Turing's view. And you'll be doing a good deal of I/O in a distributed application like this. So just being TC isn't good enough. Just how much I/O you'll be doing depends on your application. There are some problems that could get sufficient bandwidth by having an intern load data off a floppy. Others are going to need high-speed fiber optic connections in order to keep up. Some problems are going to be just plain slower than doing it on a single machine. In any case, you could certainly do this with Perl. Would it be useful? If the application's bottleneck is I/O, then Perl would probably be a viable choice. However, good candidates for distributed systems are usually not I/O-bound. They're CPU-bound, like "take this DES-encrypted message and try decrypting it with keys `x` through `x + 2*y`, and let me know if any of them break the message". For something like that, you want a good number-cruncher language like C or FORTRAN. "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni*.	[reply] [d/l] [select]
Re^3: Google's MapReduce by Anonymous Monk on Oct 27, 2004 at 23:58 UTC
If the application's bottleneck is I/O, then Perl would probably be a viable choice. That depends on what kind of I/O we are talking about. There are various kinds of I/O, of which I will mention three: Network I/O. The least interesting category. If your application is bound by network I/O, there isn't much you can do except upgrade your network. Disk I/O. Interesting category, and which brings us in the realm of SANs, fibre-channel and multiple controllers. You're right that Perl might be a good for those applications. CPU-Memory I/O. Perl would absolutely suck for those kind of applications, as Perl is very memory hungry, and gives the programmer very little control over what is stored where. It uses gazillion pointers, storing stuff all over the place, resulting in a low cache hit ratio.	[reply]


Perl Monk, Perl Meditation
	PerlMonks