Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel.

Being a relative newcomer to unix, forking, threading, and parallelization versus serialization has been kind of a murky black art to me. Only recently have I felt like I have begun to get a handle on things.

Anyway, I wrote a script to demonstrate the difference between running a series of five simple two-second commands "serially" (one after the other) versus running them in parallel. Serially, it takes ten seconds. Parallelly, it takes two. So obviously, this forking stuff is something that can save time if mastered :)

To understand this script, it's important to know that in bash, you can fork a process by using the & character. There may be other ways to fork a process, but that's the way I use. Also, processes in () run in their own process space. So if you have two commands, cmd1 and cmd2, and you want to run them in serial you can do this with

( (cmd1 ) &); ( (c cmd2 ) &);

This general recipe can be applied to as many commands as you want, and the commands inside the parenthesis can be arbitrarily complex. Since this transformation seemed like the kind of thing I might want to do more than once, I wrote a function to do this: parallelize_em in the script below. The bash command then gets run from the perl script using backticks. Simple :)

I am curious what the other monks think of this, and how they deal with this issue of forking to get speedup. I am sure there are modules on CPAN that accomplish this same kind of thing, and I am curious what is being used out there.

Anyway, I hope this simple little demo of forking and paralellization helps some beginners out there in perl land. And maybe in the responses I will learn of ways to accomplish this that are better than what I proposed.

Long live perl!

The Output:

$./parallelize_em_demo.pl
touch a; ls -l a; sleep 2
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 a
touch b; ls -l b; sleep 2
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 b
touch c; ls -l c; sleep 2
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 c
touch d; ls -l d; sleep 2
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 d
touch e; ls -l e; sleep 2
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 e

time elapsed serial: 10

( ( touch a; ls -l a; sleep 2 ) & );( ( touch b; ls -l b; sleep 2 ) & 
+);( ( touch c; ls -l c; sleep 2 ) & );( ( touch d; ls -l d; sleep 2 )
+ & );( ( touch e; ls -l e; sleep 2 ) & );
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 a
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 b
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 c
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 e
-rw-r--r--  1 hartmann users 0 2006-06-09 17:58 d

time elapsed parallel: 2

$
[download]

The script:

hartmann@ds0050:~/learning/forkArena> cat parallelize_em_demo.pl
#!/usr/bin/perl
use strict;
use warnings;
use Carp qw(confess);

my @commands = map {
  "touch $_; ls -l $_; sleep 2";
} qw(a b c d e);

my ( $time_start, $time_elapsed_serial, $time_elapsed_parallel);

$time_start=time();
#does first, waits two seconds, does the second, waits two seconds, et
+c. (should take about ten seconds)
for my $bash_command ( @commands  ) {
  run_bash_command($bash_command);
}
$time_elapsed_serial = time()-$time_start;
print "\ntime elapsed serial: $time_elapsed_serial\n\n";

#does commands in parallel. (should take about two seconds)
my $parallel_running_command = parallelize_bash_commands([@commands]);

$time_start = time();
run_bash_command($parallel_running_command);
$time_elapsed_parallel = time()-$time_start;
print "\ntime elapsed parallel: $time_elapsed_parallel\n\n";

sub run_bash_command {
  my $command = shift or die "no command";
  print "$command\n";
  print `$command`;
}

sub parallelize_bash_commands {
  my $commands = shift or confess "no commands";
  ref($commands) eq 'ARRAY' or confess "not an array";

  my $parallel_running_command = "";
  for my $command ( @$commands  ) {
    $parallel_running_command .= "( ( $command ) & );";
  }
  return $parallel_running_command;
}
[download]

Comment on Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. Select or Download Code

Replies are listed 'Best First'.
Re: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by Zaxo (Archbishop) on Jun 09, 2006 at 19:06 UTC
Perl has a finer control over forking than any shell. It rivals the system C libraries in that regard. You may enjoy comparing your code to perl's native system calls for the job. `my %kid; for (@commands) { defined(my $cpid = fork) or sleep 1, redo; $cpid and $kid{$cpid} = 1, next; # parent %kid = (); # child exec '/bin/bash', '-c', $_; # thanks, ikegami exit 1; } delete $kid{+wait} while %kid; print "@{[times]}\n";` [download] After Compline, Zaxo	[reply] [d/l]
Re^2: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by salva (Canon) on Jun 09, 2006 at 20:04 UTC
Or using some handy module... `use Proc::Queue qw(system_back all_exit_ok), size => 8; # this ensures that, at most, 8 # child processes run at any time my @pids = map { system_back $_ } @commands; all_exit_ok(@pids) or warn "some processes failed\n";` [download]	[reply] [d/l]
Re: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by ambrus (Abbot) on Jun 09, 2006 at 18:05 UTC
Just let me note that in bash, `&` is a command separator just like `;` is, so you can simply write `cmd1 & cmd2 &` [download] instead of `( (cmd1 ) &); ( (c cmd2 ) &);` [download]	[reply] [d/l] [select]
Re: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by graff (Chancellor) on Jun 10, 2006 at 15:44 UTC
I wrote a script to demonstrate the difference between running a series of five simple two-second commands "serially" (one after the other) versus running them in parallel. Serially, it takes ten seconds. Parallelly, it takes two... `touch a; ls -l a; sleep 2` [download] I think your choice of commands for demonstration is a bit too simple -- to the extent that the results may be misleading. If you parallelize any heavy processing on a single machine, you will of course see a slow down in the execution time for any single instance of the process, relative to how long it would take if it weren't running in parallel with other heavy processes. Given the nature of multi-processing, there will be a trade-off point somewhere: some number N such that running N processes in parallel will be faster than running them serially, but running N+1 in parallel will be slower than, say, running (N+1)/2 in parallel, followed serially by running the remainder in parallel. Mileage will vary depending on how heavy the processing is, and what resources are needed most: memory-bound, cpu-bound and io-bound jobs might show slightly different trade-offs, depending on how you combine them and what your hardware happens to be.	[reply] [d/l]
Re^2: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by Anonymous Monk on Jun 12, 2006 at 13:11 UTC
This is completely true. The only reason that this looks faster is because the perl scripts are sleeping for 2 seconds. If they each had 2000 milliseconds worth of processing to do, running them in parallel would still leave you with 10 seconds of processor time needed. Parallelism doesn't magically give you more processors. Parallel processing is useful when you can compute lots of partial results simultaneously that can then be combined as inputs to another algorithm.	[reply]
Re: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by ww (Archbishop) on Jun 09, 2006 at 16:30 UTC
++ (I think) for the specificity of your title! (:<})	[reply]
Re: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by ikegami (Patriarch) on Jun 09, 2006 at 19:11 UTC
The title of your post and your variable names refer to the use of `bash`, but you're using `sh`.	[reply] [d/l] [select]
Re^2: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by hossman (Prior) on Jun 09, 2006 at 23:01 UTC
On many systems I see now a days, /bin/sh is a symlink to /bin/bash ... there are a lot of people in the world who don't know there use to be a simple shell called 'sh' that didn't have all of bash's bells and whistles.	[reply]
Re^3: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by ikegami (Patriarch) on Jun 09, 2006 at 23:10 UTC
Doesn't `bash` behave differently when called as `sh`? I seem to recall that from a long time ago. On my system `bash` and `sh` are different.	[reply] [d/l] [select]
Re^3: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by dsheroh (Monsignor) on Jun 10, 2006 at 15:55 UTC
Are you perhaps seeing mostly Linux systems? All major Linux distros seem to link /bin/sh to /bin/bash, as you said, but I haven't seen that very often outside the Linux realm. e.g., HP-UX doesn't install bash at all by default and instead has ksh as its standard shell.	[reply]
Re: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by vkon (Curate) on Jun 12, 2006 at 13:26 UTC
why `my $parallel_running_command = parallelize_bash_commands([@commands]);` [download] and not simplier and more efficient `my $parallel_running_command = parallelize_bash_commands(\@commands);` [download] Here is same but simplier, thus more maintenable version of parallelize_bash_commands function: `sub parallelize_bash_commands { return join '', map { "( ( $_ ) & );" } @$_; }` [download] Why you checking for correctness of passed array ref into this sub? Let Perl do it. Instead, you'll better check whether your passed commands contain "dangerous" characters like "(", "&", etc, which will break your sub in a much worse way!	[reply] [d/l] [select]
Re^2: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by tphyahoo (Vicar) on Jun 15, 2006 at 12:04 UTC
A very good tip. I thought I was defending against, for example, passing in a hashref by mistake. But it appears that perl will throw an error in this case. Yay perl! However, I had to modify the sub you proposed to get it to work: `return join '', map { "( ( $_ ) & );" } @{ $_[0] };` Here's the code with a hashref, which gets thrown a nice error. #!/usr/bin/perl use strict; use warnings; use Carp qw(confess); my @commands = map { "echo $_; sleep 2"; } qw(a b c d e); #does first, waits two seconds, does the second, waits two seconds, et +c. (should take about ten seconds) #for my $bash_command ( @commands ) { # time_bash_command($bash_command); #} #does commands in parallel. (should take about two seconds) my $parallel_running_command = parallelize_bash_commands([@commands]); $parallel_running_command = parallelize_bash_commands( { a => 'sleep 2 +', b=> 'sleep 2', c => 'sleep 2' } ); sub parallelize_bash_commands { return join '', map { "( ( $_ ) & );" } @{ $_[0] }; #return join '', map { "( ( $_ ) & );" } @{ my $input = shift or di +e "no input" }; # alternative -- also works, and checks if you forgot + to pass in a var. time_bash_command($parallel_running_command); } sub time_bash_command { my $command = shift or die "no command"; $command = "time `$command`"; print "$command\n"; print `$command`; } [download]	[reply] [d/l] [select]
Re: Using perl to speed up a series of bash commands by transforming them into a single command that will run everything in parallel. by rvosa (Curate) on Jun 23, 2006 at 03:23 UTC
I think 'serialization' means translating a data structure from one medium to another (from memory to a file, say - or maybe between programming languages). For example what you might use Storable, Data::Dumper or Python::Serialise::Pickle for.	[reply]