comment on

Dear monks,

I am essentially writing a Perl script that divides a large input file for a text processing tool, so that I can process the files faster. I am working on a CentOS 6 based cluster, where each CPU has 16 cores. My idea is to split the input file into 16 parts, and run 16 instances of the text processing tool, and once all of them are done, I parse the output and merge it into a single file. In addition, the script will continue to process the next input file in a similar way. I have achieved that using fork(), wait() and exec() as follows (Omitting code that is not relevant):

    use strict;
    use warnings;
    use POSIX ":sys_wait_h";

    #Split input files into parts and store the filenames into array @
+parts
    ...
    my %children;
    foreach my $part (@parts) {
        my $pid = fork();
        die "Cannot fork for $part\n" unless defined $pid;
        if ($pid == 0) {
            exec("sh text_tool $part > $part.out") or die "Cannot exec
+ $part\n";
        }
        print STDERR "Started processing $part with $pid at ".localtim
+e."\n";
        $children{$pid} = $part;
    }
    
    while(%children) {
        my $pid = wait();
        die "$!\n" if $pid < 1;
        my $part = delete($children{$pid});
        print STDERR "Finished processing $part at ".localtime."\n";
    }
[download]

While I got what I wanted, there is a small problem. Due to the nature of the text processing tool, some parts get completed much before others, in no specific order. The difference is in hours, which means that many cores of the CPU are idle for a long time, just waiting for few parts to finish.

This is where I need help. I want to keep checking which part (or corresponding process) has exited successfully, so that I can start the processing of the same part of the next input file. I need your wisdom on how I can achieve this. I tried searching a lot on various forums, but did not understand correctly how this can be done.

Thanks.

------UPDATE---------

Using a hash, I can now find out which process is exiting when. But I fail to understand how to use this code in an if block, so that I can start the next process. Can someone help me with that? I have updated the code accordingly.

----------------UPDATE 2--------------

I guess it's working now. Using Parallel::ForkManager, and a hash of arrays that stores the pids of each input file, I am able to track the sub processes of each file separately. By maintaining a count of number of subprocesses exited, I can call the sub for output parsing as soon as the count reaches 16 for an input file. I will come back if I run into any other problem.

Thanks a lot for all the help :)

P.S. Is there any flag that I have to set that this thread is answered/solved?

In reply to Wait for individual sub processes [SOLVED] by crackerjack.tej

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.