Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear wise Monks,
I hereby ask your wisdom regarding the following problem:
I have a perl wrapper script that receives an input file with some protein sequences (I work with Bioinformatics) and then it calls a JAVA program that needs to run on each of the sequences separately (i,e, this piece of code cannot be parallelized, not written by me and I don't know JAVA). What I was thinking to do is, since I have like a 20 core machine, I could split the big file into smaller ones, run the JAVA code in parallel and then output all results in one file (the output file that the Perl wrapper will use to show to the user).
My question therefore is, is there a straight-forward way to do this task? I am not very experienced with Perl so if you would please be patient with me I would really appreciate that.

Replies are listed 'Best First'.
Re: Run a script in parallel mode
by Corion (Patriarch) on May 26, 2015 at 12:26 UTC

    The easiest way would be to split up the file and then use Parallel::ForkManager (on Unixish OSes) or threads to start multiple tasks via system. Also see runN by Dominus, which wraps this "run multiple programs through the shell" in a short script. Also see GNU parallel, which does the same.

      If your javaprg reads a sequence from the command line:
      cat sequences | parallel javaprg --sequence {} > output
      If your javaprg reads a sequence from standard input (STDIN):
      cat sequences | parallel --pipe -N1 javaprg > output
      If you sequences are delimited by \n> and javaprg wants 10 records per run:
      cat sequences.fasta | parallel --recstart '>' --recend '\n' --pipe -N1 +0 javaprg > output
      See also biostars.org/p/63816/
Re: Run a script in parallel mode
by salva (Canon) on May 26, 2015 at 12:45 UTC
    There should be a way to run the 20 instances of your Java program as different threads inside just one Java Virtual Machine (JVM) process.

    It may be more effective than 20 parallel JVM processes fighting for the same system resources (RAM+CPU) as Java is not very friendly in that regard.

    Update: There is Nailgun!

      You mean that there is a problem if I call the java code like 10 times simultaneously?
        Not a problem as such, but it could be not the most efficient way.
Re: Run a script in parallel mode
by marioroy (Prior) on May 26, 2015 at 14:52 UTC

    Update: Workers remove the input file after running.

    Update: Changed from FS to RS option.

    Update: The OP mentioned having a big file. Also, a sequence file. I added the FS option to chunk the input file by records, not by lines. This works quite well. A chunk size value of 100 means 100 records, not 100 lines.

    There are may possibilities with various modules on CPAN. Below, I describe a way using MCE. I can follow up with another post with a version which unlinks tmp files orderly while running if processing in the thousands.

    MCE::Signal provides a $tmp_dir location. MCE itself is a chunking engine. Thus, each chunk comes with a chunk_id value. The sprintf is used mainly to have ordered output from running cat *.out.

    #!/usr/bin/env perl use strict; use warnings; use MCE::Signal qw($tmp_dir); use MCE::Flow; my $proteinFile = shift; mce_flow_f { RS => "\n>", chunk_size => 100, max_workers => 20, use_slurpio => 1 }, sub { my ($mce, $slurp_ref, $chunk_id) = @_; # pad with zeros -- 4 digits; e.g. 0001, 0002, ... $chunk_id = sprintf "%04d", $chunk_id; # create input file for java open my $out_fh, ">", "$tmp_dir/$chunk_id.in"; print $out_fh $$slurp_ref; close $out_fh; # launch java system("java -Xmx300m java_code/ALPHAtest -a tables/A.TRAINED -e ta +bles/E.TRAINED -c tables/conf.tat -f $tmp_dir/$chunk_id.in > $tmp_dir +/$chunk_id.out"); # unlink input file after running unlink "$tmp_dir/$chunk_id.in"; }, $proteinFile; # the tmp_dir is removed automatically when the script terminates system("cd $tmp_dir; cat *.out");
      I get the error:
      MCE::Flow: (FS) is not a valid constructor argument
      when I try to execute:
      perl parallel.pl MYFILE

        I apologize for the error. The option was met to be RS, not FS. Thank you for reporting the issue though. Set chunk size accordingly if giving this a try. Am not sure the format for the sequence file. Thus, went with RS => "\n>".

Re: Run a script in parallel mode
by BrowserUk (Patriarch) on May 26, 2015 at 12:52 UTC

    What are the command line arguments for running the java program?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      It is a simple command like the following:
      system ("java -Xmx300m java_code/ALPHAtest -a tables/A.TRAINED -e tabl +es/E.TRAINED -c tables/conf.tat -f $infile.TO_SCAN > $infile.RESULT") +;

        Then something simple like this might do:

        #! perl -slw use strict; my $protienFile = shift; my $size = -S protienFile; my $chunk = int( $size / 20 ); open IN, '<', $protienFile or die; my %procs; for my $n ( 1 .. 20 ) { open O, '>', "temp$n.in" or die $!; print O, scalar <I> while tell( O ) < ( $n * $chunk ); close O; my $pid; if( $pid = fork() ) { ++$procs{ $pid }; } elsif( defined $pid ) { exec "java -Xmx300m java_code/ALPHAtest -a tables/A.TRAINED -e + tables/E.TRAINED -c tables/conf.tat -f temp$n.in > temp$n.out"; } else { die "Fork failed"; } } while( keys %procs ) { my $pid = wait; delete $procs{ $pid }; } open O, '>', $protienFile . '.out' or die $!; for my $n ( 1 .. 20 ) { open I, '<', "temp$n.out" or die $!; print O, <i>; close I; unlink "temp$n.in", "temp$n.out"; } close O;

        Note:That's untested.

        You might get a little more clever and use the piped-open to run the commands and read the output back directly into the parent for merging, but handling multiple concurrent input streams without mixing up the results gets messy.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Run a script in parallel mode
by locked_user sundialsvc4 (Abbot) on May 26, 2015 at 18:01 UTC

    I would also caution you to first, test your ruling assumption:   that 20 processes running in parallel actually will complete the total job faster.   I am not so sure.

    In fact, I doubt it don’t believe it.

    You see, you talk about “a big file.”   That means:   I/O.   Therefore, a procedure which is likely to be, “fundamentally, I/O bound.”   The completion-time of the procedure probably won’t be bound by the speed of the CPU, nor the availability of cores.   Instead, it will be bound by how fast the I/O subsystem can move data into and out of the computer’s memory.   (As a simple test, run the time command on the existing Java program, and compare the wall-time to the CPU-time.   I’ll wager that the CPU-time is much smaller.   This means that the process spends most of its time waiting for an I/O operation to complete.

    The simplest test would be to do this:   open up four or five shell-command windows, make four or five identical copies of your test file, start the same program in all five windows, and start your stopwatch.   If you discover that all five instances, running in parallel on the same data, complete in about the same amount of time that they would if run one-by-one, then it might be profitable to pursue (and to implement) your theory.   (As a further test, split the file into five pieces, by whatever means, and run the test again.   All five of them, running in parallel, should complete in one-fifth the time or less.)

    If you don’t clearly see such results ... and I predict that you will not ... then, “save your effort.”   The odds are not in your favor that your efforts will have been profitably spent, IMHO, and if this be the case, find out sooner rather than later.

      MCE applies "graceful" IO while reading input. Only a single worker reads at any given time. This allows for sequential IO which is typically faster than random IO, especially for mechanical drives.