Struggling with fork, waitpid, wait

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to use fork for the first time and I don't quite get it.
The situation is this: I have a few thousand text files that need to be processed with a single script. I have several machines at my disposition which can each run 8 processes in parallel. I have created an array containing the names of the files to be processed and the respective arguments.
I want to loop through this array, figure out how many processes (other people) may be running on the machine, start 8-this number of processes, wait until they finish, then start another batch, and so on until all files have been processed.
How do I use fork to do this?

Comment on Struggling with fork, waitpid, wait

Replies are listed 'Best First'.
Re: Struggling with fork, waitpid, wait by almut (Canon) on May 28, 2010 at 13:32 UTC
As for the subtask of running 8 parallel processes per machine, Parallel::ForkManager might be easier to get started with, as it would largely hide the lower-level fork/exec/wait details from you. (And in case you want to learn how things work under the hood, take a look at the the module's source — it's just ~150 lines of code, and not all that difficult to understand.)	[reply]
Re^2: Struggling with fork, waitpid, wait by sierpinski (Chaplain) on May 28, 2010 at 14:10 UTC
I definitely agree... For it's complexity (or so it might seem), Parallel::ForkManager was, I thought, surprisingly easy to get get working. The example code on the module's page was helpful, and the explanation of the methods was more than sufficient. I'd recommend that module as well. The number of forked processes to run at one time is easily set... maxprocs is a parameter to the object. Just calculate that ahead of time and drop it in. Much faster than doing it manually from the parent process. `my $max_procs = 8; my $pm = new Parallel::ForkManager($max_procs);` [download] You can also add in run_on_finish for post_process code, and wait_all_children forces the parent to wait for all processes to complete before continuing. HTH.	[reply] [d/l]
Re: Struggling with fork, waitpid, wait by jethro (Monsignor) on May 28, 2010 at 13:40 UTC
You could use Parallel::ForkManager, which was designed for such cases. But are you sure you will get any speed-up with this? If processing a text file is not much work (i.e. just a regex applied to every line for example), your program will spend most of its time waiting for the harddisk. Whether it is doing that in one process or in parallel won't change anything about its speed (if all machines and processes access the same data pool/hard disk) One way to find out is to let your script run as a single process and measure the time. Then let it run with one single static text (without reload from disk) the same few thousand times. Only the time the second program runs can be reduced by parallelizing.	[reply]
Re: Struggling with fork, waitpid, wait by choroba (Cardinal) on May 28, 2010 at 13:28 UTC
I fear fork will not help you here much. Rather, see IPC::PerlSSH, GRID::Machine or similar...	[reply]
Re: Struggling with fork, waitpid, wait by ikegami (Patriarch) on May 28, 2010 at 15:42 UTC
Gearman has Perl bindings	[reply]
Re: Struggling with fork, waitpid, wait by Anonymous Monk on May 29, 2010 at 19:22 UTC
Thanks all for the answers. The pointers were very helpful. I'm still working on it, but I'm beginning to get somewhere.	[reply]