sousuffer has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,
I'm new to the forums, so I apologize if this question is a bit simple. I have an array of files stores in the variable @files. For each file, I would like to run a unix-based command line program called FastQC, which generates an output file. I have 5 nodes with 12 cores per node, so I would like to be able to utilize this power to run 50 instances of the FastQC program simultaneously as part of my perl pipeline.

Ideally, I would also like to grab the exit code or output for each instance, since piping the output to a file would add an extra scanning step to ensure that all processes completed successfully - but if I have to do this I will.

The below code runs the processes sequentially, producing the result and an ".out" file with the status of the run:

for my $file (@files) { my $fastqc_script = "$fastqc_folder/fastqc $input_folder/$file + --outdir=$fastqc_output_folder > $fastqc_output_folder/$file.out"; `$fastqc_script`; }
The goal would be primarily to start the second run immediately after starting the first one, etc. The secondary goal would be to collect the exit codes in an array as they complete. Any help would be greatly appreciated. Thank you very much in advance for your help.

Replies are listed 'Best First'.
Re: Execution of parallel unix system commands
by atcroft (Abbot) on Jan 16, 2014 at 18:05 UTC

    I would look at Parallel::ForkManager. It has the capability to keep a specified number of processes running (you mentioned you would like to 50 instances simultaneously), and since version 0.7.6 you can return data back to the parent when each child process finishes.

    Hope that helps.

Re: Execution of parallel unix system commands
by ww (Archbishop) on Jan 16, 2014 at 17:16 UTC
    Sorry, I took a (mental) whack at this... and (naively) said -- "Wow, easy; it's nested-loop time: split the files up into five, 10 member arrays, and then feed the elements of those to the 5 nodes, using code on those nodes to accept the feeds and to assign them to cores."

    Then I tried translating that WAG into (pseudo)code.

    The translation step failed when I realized I'm missing some data -- which suggests that perhaps many of us are missing data needed to help. (See How do I post a question effectively? and On asking for help ).

    So, perhaps you'll speed us toward answers directed to the problem case, rather than the post, by providing a small sample of the filenames in @file, your skills at dispatching data from one node to another (presuming you have the necessary permissions) and -- some code to show us how you plan to "collect the exit codes in an array as they complete (and state whether "complete means 'as a node finishes its assigned work' or means 'as each thread exits.'

    Come, let us reason together: Spirit of the Monastery
Re: Execution of parallel unix system commands
by zentara (Cardinal) on Jan 17, 2014 at 12:52 UTC
    Here is some recently posted code, like atcroft said. Get return values from Parallel::ForkManger's forked children.
    #!/usr/bin/perl use 5.010; use strict; use warnings; use Parallel::ForkManager; #use Data::Printer; my $pm = Parallel::ForkManager->new(2); $pm->run_on_finish( sub { # result from the child will be passed as 6th arg to callback my $res = $_[5]; # p $res; print "$res\n"; } ); for (1..3) { $pm->start and next; # from here and till $pm->finish child process running # do something useful and store result in $res my $res = { aaa => $_ }; # this will terminate child and pass $res to parent process $pm->finish(0, $res); } $pm->wait_all_children;

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: Execution of parallel unix system commands
by Anonymous Monk on Jan 16, 2014 at 17:06 UTC
    I'd go for xargs -P here. Except that it can't do input redirection. Perhaps your fastqc tool has an option for that?
    # untested # launches 42 processes at a time open my $xargs, '|-', qw(xargs -0 -P 42 -n 2), qw($fastqc_folder/fastqc) or die $!; for my $file (@files) { print $xargs join("\000", "--outdir=$fastqc_output_folder", "$input_folder/$file"), "\000"; } close $xargs;