Dear monks,
I am essentially writing a Perl script that divides a large input file for a text processing tool, so that I can process the files faster. I am working on a CentOS 6 based cluster, where each CPU has 16 cores. My idea is to split the input file into 16 parts, and run 16 instances of the text processing tool, and once all of them are done, I parse the output and merge it into a single file. In addition, the script will continue to process the next input file in a similar way. I have achieved that using fork(), wait() and exec() as follows (Omitting code that is not relevant):
use strict; use warnings; use POSIX ":sys_wait_h"; #Split input files into parts and store the filenames into array @ +parts ... my %children; foreach my $part (@parts) { my $pid = fork(); die "Cannot fork for $part\n" unless defined $pid; if ($pid == 0) { exec("sh text_tool $part > $part.out") or die "Cannot exec + $part\n"; } print STDERR "Started processing $part with $pid at ".localtim +e."\n"; $children{$pid} = $part; } while(%children) { my $pid = wait(); die "$!\n" if $pid < 1; my $part = delete($children{$pid}); print STDERR "Finished processing $part at ".localtime."\n"; }
While I got what I wanted, there is a small problem. Due to the nature of the text processing tool, some parts get completed much before others, in no specific order. The difference is in hours, which means that many cores of the CPU are idle for a long time, just waiting for few parts to finish.
This is where I need help. I want to keep checking which part (or corresponding process) has exited successfully, so that I can start the processing of the same part of the next input file. I need your wisdom on how I can achieve this. I tried searching a lot on various forums, but did not understand correctly how this can be done.
Thanks.
------UPDATE---------
Using a hash, I can now find out which process is exiting when. But I fail to understand how to use this code in an if block, so that I can start the next process. Can someone help me with that? I have updated the code accordingly.
----------------UPDATE 2--------------
I guess it's working now. Using Parallel::ForkManager, and a hash of arrays that stores the pids of each input file, I am able to track the sub processes of each file separately. By maintaining a count of number of subprocesses exited, I can call the sub for output parsing as soon as the count reaches 16 for an input file. I will come back if I run into any other problem.
Thanks a lot for all the help :)
P.S. Is there any flag that I have to set that this thread is answered/solved?
In reply to Wait for individual sub processes [SOLVED] by crackerjack.tej
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |