Re: Run a script in parallel mode
by Corion (Patriarch) on May 26, 2015 at 12:26 UTC
|
The easiest way would be to split up the file and then use Parallel::ForkManager (on Unixish OSes) or threads to start multiple tasks via system. Also see runN by Dominus, which wraps this "run multiple programs through the shell" in a short script. Also see GNU parallel, which does the same.
| [reply] [d/l] |
|
|
If your javaprg reads a sequence from the command line:
cat sequences | parallel javaprg --sequence {} > output
If your javaprg reads a sequence from standard input (STDIN):
cat sequences | parallel --pipe -N1 javaprg > output
If you sequences are delimited by \n> and javaprg wants 10 records per run:
cat sequences.fasta | parallel --recstart '>' --recend '\n' --pipe -N1
+0 javaprg > output
See also biostars.org/p/63816/ | [reply] [d/l] [select] |
Re: Run a script in parallel mode
by salva (Canon) on May 26, 2015 at 12:45 UTC
|
There should be a way to run the 20 instances of your Java program as different threads inside just one Java Virtual Machine (JVM) process.
It may be more effective than 20 parallel JVM processes fighting for the same system resources (RAM+CPU) as Java is not very friendly in that regard.
Update: There is Nailgun! | [reply] |
|
|
You mean that there is a problem if I call the java code like 10 times simultaneously?
| [reply] |
|
|
Not a problem as such, but it could be not the most efficient way.
| [reply] |
Re: Run a script in parallel mode
by marioroy (Prior) on May 26, 2015 at 14:52 UTC
|
Update: Workers remove the input file after running.
Update: Changed from FS to RS option.
Update: The OP mentioned having a big file. Also, a sequence file. I added the FS option to chunk the input file by records, not by lines. This works quite well. A chunk size value of 100 means 100 records, not 100 lines.
There are may possibilities with various modules on CPAN. Below, I describe a way using MCE. I can follow up with another post with a version which unlinks tmp files orderly while running if processing in the thousands.
MCE::Signal provides a $tmp_dir location. MCE itself is a chunking engine. Thus, each chunk comes with a chunk_id value. The sprintf is used mainly to have ordered output from running cat *.out.
#!/usr/bin/env perl
use strict;
use warnings;
use MCE::Signal qw($tmp_dir);
use MCE::Flow;
my $proteinFile = shift;
mce_flow_f {
RS => "\n>", chunk_size => 100, max_workers => 20, use_slurpio => 1
},
sub {
my ($mce, $slurp_ref, $chunk_id) = @_;
# pad with zeros -- 4 digits; e.g. 0001, 0002, ...
$chunk_id = sprintf "%04d", $chunk_id;
# create input file for java
open my $out_fh, ">", "$tmp_dir/$chunk_id.in";
print $out_fh $$slurp_ref;
close $out_fh;
# launch java
system("java -Xmx300m java_code/ALPHAtest -a tables/A.TRAINED -e ta
+bles/E.TRAINED -c tables/conf.tat -f $tmp_dir/$chunk_id.in > $tmp_dir
+/$chunk_id.out");
# unlink input file after running
unlink "$tmp_dir/$chunk_id.in";
}, $proteinFile;
# the tmp_dir is removed automatically when the script terminates
system("cd $tmp_dir; cat *.out");
| [reply] [d/l] |
|
|
MCE::Flow: (FS) is not a valid constructor argument
when I try to execute:
perl parallel.pl MYFILE
| [reply] [d/l] [select] |
|
|
I apologize for the error. The option was met to be RS, not FS. Thank you for reporting the issue though. Set chunk size accordingly if giving this a try. Am not sure the format for the sequence file. Thus, went with RS => "\n>".
| [reply] |
|
|
|
|
|
|
|
Re: Run a script in parallel mode
by BrowserUk (Patriarch) on May 26, 2015 at 12:52 UTC
|
| [reply] |
|
|
It is a simple command like the following:
system ("java -Xmx300m java_code/ALPHAtest -a tables/A.TRAINED -e tabl
+es/E.TRAINED -c tables/conf.tat -f $infile.TO_SCAN > $infile.RESULT")
+;
| [reply] [d/l] |
|
|
#! perl -slw
use strict;
my $protienFile = shift;
my $size = -S protienFile;
my $chunk = int( $size / 20 );
open IN, '<', $protienFile or die;
my %procs;
for my $n ( 1 .. 20 ) {
open O, '>', "temp$n.in" or die $!;
print O, scalar <I> while tell( O ) < ( $n * $chunk );
close O;
my $pid;
if( $pid = fork() ) {
++$procs{ $pid };
}
elsif( defined $pid ) {
exec "java -Xmx300m java_code/ALPHAtest -a tables/A.TRAINED -e
+ tables/E.TRAINED -c tables/conf.tat -f temp$n.in > temp$n.out";
}
else {
die "Fork failed";
}
}
while( keys %procs ) {
my $pid = wait;
delete $procs{ $pid };
}
open O, '>', $protienFile . '.out' or die $!;
for my $n ( 1 .. 20 ) {
open I, '<', "temp$n.out" or die $!;
print O, <i>;
close I;
unlink "temp$n.in", "temp$n.out";
}
close O;
Note:That's untested.
You might get a little more clever and use the piped-open to run the commands and read the output back directly into the parent for merging, but handling multiple concurrent input streams without mixing up the results gets messy.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
| [reply] [d/l] |
|
|
|
|
Re: Run a script in parallel mode
by locked_user sundialsvc4 (Abbot) on May 26, 2015 at 18:01 UTC
|
I would also caution you to first, test your ruling assumption: that 20 processes running in parallel actually will complete the total job faster. I am not so sure.
In fact, I doubt it don’t believe it.
You see, you talk about “a big file.” That means: I/O. Therefore, a procedure which is likely to be, “fundamentally, I/O bound.” The completion-time of the procedure probably won’t be bound by the speed of the CPU, nor the availability of cores. Instead, it will be bound by how fast the I/O subsystem can move data into and out of the computer’s memory. (As a simple test, run the time command on the existing Java program, and compare the wall-time to the CPU-time. I’ll wager that the CPU-time is much smaller. This means that the process spends most of its time waiting for an I/O operation to complete.
The simplest test would be to do this: open up four or five shell-command windows, make four or five identical copies of your test file, start the same program in all five windows, and start your stopwatch. If you discover that all five instances, running in parallel on the same data, complete in about the same amount of time that they would if run one-by-one, then it might be profitable to pursue (and to implement) your theory. (As a further test, split the file into five pieces, by whatever means, and run the test again. All five of them, running in parallel, should complete in one-fifth the time or less.)
If you don’t clearly see such results ... and I predict that you will not ... then, “save your effort.” The odds are not in your favor that your efforts will have been profitably spent, IMHO, and if this be the case, find out sooner rather than later.
| |
|
|
| [reply] |