Run a script in parallel mode

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Run a script in parallel mode by Corion (Patriarch) on May 26, 2015 at 12:26 UTC
The easiest way would be to split up the file and then use Parallel::ForkManager (on Unixish OSes) or threads to start multiple tasks via system. Also see runN by Dominus, which wraps this "run multiple programs through the shell" in a short script. Also see GNU `parallel`, which does the same.	[reply] [d/l]
Re^2: Run a script in parallel mode by Anonymous Monk on May 26, 2015 at 18:11 UTC
If your javaprg reads a sequence from the command line: `cat sequences \| parallel javaprg --sequence {} > output` [download] If your javaprg reads a sequence from standard input (STDIN): `cat sequences \| parallel --pipe -N1 javaprg > output` [download] If you sequences are delimited by \n> and javaprg wants 10 records per run: `cat sequences.fasta \| parallel --recstart '>' --recend '\n' --pipe -N1 +0 javaprg > output` [download] See also biostars.org/p/63816/	[reply] [d/l] [select]
Re: Run a script in parallel mode by salva (Canon) on May 26, 2015 at 12:45 UTC
There should be a way to run the 20 instances of your Java program as different threads inside just one Java Virtual Machine (JVM) process. It may be more effective than 20 parallel JVM processes fighting for the same system resources (RAM+CPU) as Java is not very friendly in that regard. Update: There is Nailgun!	[reply]
Re^2: Run a script in parallel mode by Anonymous Monk on May 26, 2015 at 12:50 UTC
You mean that there is a problem if I call the java code like 10 times simultaneously?	[reply]
Re^3: Run a script in parallel mode by salva (Canon) on May 26, 2015 at 12:53 UTC
Not a problem as such, but it could be not the most efficient way.	[reply]
Re: Run a script in parallel mode by marioroy (Prior) on May 26, 2015 at 14:52 UTC
Update: Workers remove the input file after running. Update: Changed from FS to RS option. Update: The OP mentioned having a big file. Also, a sequence file. I added the FS option to chunk the input file by records, not by lines. This works quite well. A chunk size value of 100 means 100 records, not 100 lines. There are may possibilities with various modules on CPAN. Below, I describe a way using MCE. I can follow up with another post with a version which unlinks tmp files orderly while running if processing in the thousands. MCE::Signal provides a $tmp_dir location. MCE itself is a chunking engine. Thus, each chunk comes with a chunk_id value. The sprintf is used mainly to have ordered output from running cat .out. #!/usr/bin/env perl use strict; use warnings; use MCE::Signal qw($tmp_dir); use MCE::Flow; my $proteinFile = shift; mce_flow_f { RS => "\n>", chunk_size => 100, max_workers => 20, use_slurpio => 1 }, sub { my ($mce, $slurp_ref, $chunk_id) = @_; # pad with zeros -- 4 digits; e.g. 0001, 0002, ... $chunk_id = sprintf "%04d", $chunk_id; # create input file for java open my $out_fh, ">", "$tmp_dir/$chunk_id.in"; print $out_fh $$slurp_ref; close $out_fh; # launch java system("java -Xmx300m java_code/ALPHAtest -a tables/A.TRAINED -e ta +bles/E.TRAINED -c tables/conf.tat -f $tmp_dir/$chunk_id.in > $tmp_dir +/$chunk_id.out"); # unlink input file after running unlink "$tmp_dir/$chunk_id.in"; }, $proteinFile; # the tmp_dir is removed automatically when the script terminates system("cd $tmp_dir; cat .out"); [download]	[reply] [d/l]
Re^2: Run a script in parallel mode by Anonymous Monk on May 29, 2015 at 11:06 UTC
I get the error: `MCE::Flow: (FS) is not a valid constructor argument` [download] when I try to execute: `perl parallel.pl MYFILE` [download]	[reply] [d/l] [select]
Re^3: Run a script in parallel mode by marioroy (Prior) on May 29, 2015 at 11:27 UTC
I apologize for the error. The option was met to be RS, not FS. Thank you for reporting the issue though. Set chunk size accordingly if giving this a try. Am not sure the format for the sequence file. Thus, went with RS => "\n>".	[reply]
Re^4: Run a script in parallel mode by Anonymous Monk on May 29, 2015 at 12:56 UTC
Re^4: Run a script in parallel mode by Anonymous Monk on Jun 02, 2015 at 19:08 UTC
Re^5: Run a script in parallel mode by marioroy (Prior) on Jun 02, 2015 at 23:03 UTC
Some notes below your chosen depth have not been shown here
Re: Run a script in parallel mode by BrowserUk (Patriarch) on May 26, 2015 at 12:52 UTC
What are the command line arguments for running the java program? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^2: Run a script in parallel mode by Anonymous Monk on May 26, 2015 at 13:10 UTC
It is a simple command like the following: `system ("java -Xmx300m java_code/ALPHAtest -a tables/A.TRAINED -e tabl +es/E.TRAINED -c tables/conf.tat -f $infile.TO_SCAN > $infile.RESULT") +;` [download]	[reply] [d/l]
Re^3: Run a script in parallel mode by BrowserUk (Patriarch) on May 26, 2015 at 13:28 UTC
Then something simple like this might do: #! perl -slw use strict; my $protienFile = shift; my $size = -S protienFile; my $chunk = int( $size / 20 ); open IN, '<', $protienFile or die; my %procs; for my $n ( 1 .. 20 ) { open O, '>', "temp$n.in" or die $!; print O, scalar <I> while tell( O ) < ( $n * $chunk ); close O; my $pid; if( $pid = fork() ) { ++$procs{ $pid }; } elsif( defined $pid ) { exec "java -Xmx300m java_code/ALPHAtest -a tables/A.TRAINED -e + tables/E.TRAINED -c tables/conf.tat -f temp$n.in > temp$n.out"; } else { die "Fork failed"; } } while( keys %procs ) { my $pid = wait; delete $procs{ $pid }; } open O, '>', $protienFile . '.out' or die $!; for my $n ( 1 .. 20 ) { open I, '<', "temp$n.out" or die $!; print O, <i>; close I; unlink "temp$n.in", "temp$n.out"; } close O; [download] Note:That's untested. You might get a little more clever and use the piped-open to run the commands and read the output back directly into the parent for merging, but handling multiple concurrent input streams without mixing up the results gets messy. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply] [d/l]
Re^4: Run a script in parallel mode by RichardK (Parson) on May 26, 2015 at 14:21 UTC
Re^4: Run a script in parallel mode by Anonymous Monk on May 26, 2015 at 13:38 UTC
Re: Run a script in parallel mode by locked_user sundialsvc4 (Abbot) on May 26, 2015 at 18:01 UTC
I would also caution you to first, test your ruling assumption: that 20 processes running in parallel actually will complete the total job faster. I am not so sure. In fact, I ~~doubt it~~ don’t believe it. You see, you talk about “a big file.” That means: I/O. Therefore, a procedure which is likely to be, “fundamentally, I/O bound.” The completion-time of the procedure probably won’t be bound by the speed of the CPU, nor the availability of cores. Instead, it will be bound by how fast the I/O subsystem can move data into and out of the computer’s memory. (As a simple test, run the `time` command on the existing Java program, and compare the wall-time to the CPU-time. I’ll wager that the CPU-time is much smaller. This means that the process spends most of its time waiting for an I/O operation to complete. The simplest test would be to do this: open up four or five shell-command windows, make four or five identical copies of your test file, start the same program in all five windows, and start your stopwatch. If you discover that all five instances, running in parallel on the same data, complete in about the same amount of time that they would if run one-by-one, then it might be profitable to pursue (and to implement) your theory. (As a further test, `split` the file into five pieces, by whatever means, and run the test again. All five of them, running in parallel, should complete in one-fifth the time or less.) If you don’t clearly see such results ... and I predict that you will not ... then, “save your effort.” The odds are not in your favor that your efforts will have been profitably spent, IMHO, and if this be the case, find out sooner rather than later.
Re^2: Run a script in parallel mode by marioroy (Prior) on Jun 02, 2015 at 23:42 UTC
MCE applies "graceful" IO while reading input. Only a single worker reads at any given time. This allows for sequential IO which is typically faster than random IO, especially for mechanical drives.	[reply]