Eyck has asked for the wisdom of the Perl Monks concerning the following question:

What I would like to accomplish: small programs running on various machines, accepting computing jobs.

This roughly fits what MPI and PVM's are all about, but both ( and other similiar solutions ) require relatively large pieces of software to be installed, and require rather lengthy configuration. (Also, I would expect problems when trying to run compute nodes on windoze machines)

What I've got in mind, is simple executable, possibly without dependencies, that just listens on some TCP port, accepts jobs, and returns results.

This needs not to be fast, nor efficient, main advantage should be ease of installation/configuration.

How do I think perl can solve this?

Well, the most important piece of puzzle would be PAR, that would solve 'wrapping whole computing mess into single file' problem.

And after that... at first I thought about Sandbox, but I think I'm too green to efficiently use it, right now what I've got in mind is XML::RPC paired with eval. After it's all running smoothly I'll think about sandboxing and security...

Maybe instead of fighting with secure sandboxing one should just stop thinking about it, set up some kind of access control and forget about it?

I looked a bit around, but haven't seen anyone doing something like this, is it because this problem is too big for single busy programmer to handle on his own?

Replies are listed 'Best First'.
Re: Perl Grids
by Corion (Patriarch) on Sep 28, 2004 at 08:36 UTC

    The problem is less the complexity, it's more the application that at least I lack. I have some small toy code that essentially does the following:

    1. Connect to another machine (via ssh, rsh or telnet)
    2. Spawn perl -x there
    3. Pipe some small code there that does MIME-ish decoding of stuff read from STDIN and outputs MIME-ish delimited Data::Dumper elements to STDOUT

    With that code (about 50 lines or so), I get a remote Perl interpreter to which I can easily pass code to execute. The main uglyness is the passing of parameters back, as you most likely want to produce output (array elements) as soon as they become available and not when the remote function has returned completely.

    The second ugly problem is of course the collecting and coordination of multiple such processes, and for that, I would either resort to a database for serialization (thus ditching my low-infrastructure approach) or have to resort to POE for any but the simplest tasks, neither of which is really nice.

    I originally developed that code to implement a perlish rsync variant, so I only need some more or less synchronous execution, but the packaging of input and output parameters and an elegant method of actually writing the remote code is what paused the development. Using B::Deparse lets me transfer the trivial code over to the remote machine, so Perl catches syntax errors on the local side already, but what remains and stops me is some way of transferring data more efficiently via pack/unpack instead of Data::Dumper.

    Update: Added last paragraph depicting the sad state of this idea

Re: Perl Grids
by BrowserUk (Patriarch) on Sep 28, 2004 at 10:32 UTC

    Personally, rather than a central application parcelling data up, dispatching it to one of a list of known workers, then polling the workers for their results before retrieving them, and dispatching the next batch, I'd turn things around. Use the web server model--and a websever too.

    Workers hit one page to get data, and another page to return results. The central controller can be any webserver you like from Apache2 to HTTP::Daemon.

    The workers could use LWP to fetch the work_to_do. The page they fetch would contain the data to process and even an embedded code section to use to process it, plus a form on which to return the results. That easily allows for multiple uses and clean segregation of the returned data as the code comes from the page called and the processed data is dispatched back to the address referenced by the form.

    Testing can be done by accessing the server using a standard web browser and pasting "processed data". If the application needs to deal with large volumes of return data, use POST instead of GET.

    The security of the application can use the standard webserver authentication mechanisms. Assuming that the server is properly controlled and vetted, I'm assuming a LAN/intranet environment, then the code segments should be as secure as the server is. You can even use https: to provide encryption if required.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Perl Grids
by rcaputo (Chaplain) on Sep 28, 2004 at 22:31 UTC
Re: Perl Grids
by Anonymous Monk on Sep 28, 2004 at 10:47 UTC
    There are many implementations of this issue, all different, because they are tailored to different needs. distcc (distributed compiling), parallel database servers, enterprise backup solutions, and SETI, to name a few. Some use a polling mechanism (clients ask the server for work - SETI for instance), some use a pushing mechanism (distributed compiling - most of the time there won't be a compile job waiting), or something far more complex (parallel database servers).

    One piece of advice, slapping security on it as an afterthought looks like a bad idea to me. I'd solve the security implications first, and intergrate them with the solution.

Re: Perl Grids
by bwelch (Curate) on Sep 28, 2004 at 16:18 UTC
    Some thoughts... Do you anticipate needing to run anything in parallel? If so, MPI / PVM might be worth investigating. If not, things can be simpler. If all you need to do is divide up large jobs into small pieces, send them off to other systems (i.e. cluster or grid nodes) to run, then collect the results. This could result in a system with sections to:
    • Take a large job and divide it into N small jobs
    • A central (or distributed) system for keeping track of job status. This could be database tables and/or a central process.
    • Queues for submitting jobs to nodes and taking appropriate action when they are completed. If results are stored centrally in a database, this might mean only monitoring node status and tracking errors as well as completion of jobs.
    It's not impossible for one person, but the complexity can be high and debugging distributed systems can be tricky.

    A commercial product called LSF (www.platform.com) does much of this and works well with cluster applications. With it, one may create queues for specific systems or sets of systems, submit jobs, and monitor them.

    For grid versus cluster applications, all this gets more fuzzy. Are the grid systems shared? How reliable are they? How much redundancy is needed?

    Like I said, these are some thoughts and questions. Hope it's useful. :)