lightoverhead has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Did anyone have experience of using parallel computing to run perl script? When we deal with enormous data sets, if we can load the data into memory,and using more cpus to run the task, this will greatly speedup our work. However, I don't know how to assign the perl task to multiple nodes using multiple cores and memories. Generally, each core can only run a single task,i.e., a perl script. How could we split a single perl running job on several cores, several nodes? or Could we? Thanks.

Replies are listed 'Best First'.
Re: Parallel computing with perl?
by Corion (Patriarch) on Oct 27, 2008 at 06:14 UTC

    If you want to spread out the work across multiple machines, look at GRID or at the MPI or at the "native" API that is used to start programs on multiple nodes.

Re: Parallel computing with perl?
by Illuminatus (Curate) on Oct 27, 2008 at 03:30 UTC
    1. Can you provide a few more details on what your application is actually doing?
    2. What OS(es) do you plan to run on?
    3. How big is 'enormous'?
    Linux and most *nixes provide an API for tying a task (process or thread) to a specific processor, but most multi-cpu systems do a pretty good job of scheduling without much tweaking. The statement 'load the data into memory...greatly speedup(sic)' is pause for concern however. Unless the data is read-only, it can be very challenging to improve performance through parallel processing.

      1. it's actually data munging, comparing two huge datasets, operating on one or both or several datasets by some rules. The operation mostly involves searching certain pattern of some data in file according to the others, and do some operation on these data.

      2. linux system

      3. each file is about 40 million lines (data points)

      I knew very little about MPI or openMP. They can do such a job for c, but they need to be compiled. As for perl, I really have no idea how to implement it.

      If someone can give me an example, it will be greatly appreciated.

        1. comparing two huge datasets, operating on one or both or several datasets ... and do some operation on these data.

          Which is it? One, two or if more, how many more?

          Do those operations modify the original data?

          If so, do other concurrent processes need to see those changes as they happen?

        2. Each file is about 40 million lines (data points)

          And how long are those lines? Ie. What is the total size of the file(s)?

        Do the algorithms involved require random access to the entire dataset(s)? Or sequential access? Or random or sequential access to just some small subset for each top level iteration?

        All of these questions have a direct influence upon what techniques are applicable to your application (regardless of the language used). And each answer will probably lead to further questions.

        Your best hope of getting a good answer to your best way forward, would be to describe the application in some detail, noting volumes of data and times taken for existing serial processing. Making an existing serial processing script (or at least a fairly detailed pseudo-code description if the application is overly large or proprietary), would be better still.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Parallel computing with perl?
by CountZero (Bishop) on Oct 27, 2008 at 06:26 UTC
    "we deal with enormous data sets" and "we can load the data into memory" seems not compatible with one another unless you have multiples of "enormous memory" available on your box! That is, unless the data is strictly read-only, all programs running on different cores can have access to the same data in memory (and by doing so does not make it "dirty" so it gets copied to their own storage) and each program does not use lots of working storage to keep (intermediary) results.

    If you need the different programs communicate with one another, something like POE is perhaps a good start.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Parallel computing with perl?
by aquarium (Curate) on Oct 27, 2008 at 02:20 UTC
    first of all if not already done so, the script should be optimized...and the need for parallel computing may go away.
    there are already open source and commercial software that can run and co-ordinate parallel computing. obviously you will need to cut up the task into separable chunks that can run independently, within the confines/utility of the parallel scheduling system at hand.
    the hardest line to type correctly is: stty erase ^H
Re: Parallel computing with perl?
by perrin (Chancellor) on Oct 27, 2008 at 04:03 UTC
    Have you tried forking?
Re: Parallel computing with perl?
by casiano (Pilgrim) on Oct 27, 2008 at 16:58 UTC
    Have a look at the tutorial at GRID::Machine::perlparintro

    It can help if what you have available is a group of linux nodes. A shared file system will be convenient, since the files are so huge.
    Use fork or Parallel::ForkManager if it is a multiprocessor - shared memory - machine

    Casiano