Perl::ForkManager does not speed up ATCG calculation !!

jnarayan81 has asked for the wisdom of the Perl Monks concerning the following question:

I am new to parallel programming, and today I decided to test the Perl ForkManager module.

I am reading a multifasta infile and calculating ATCG percentage for each sequence. I tried to fork it to 5 different thread to speed up. Unfortunately, it takes more time with ForkManager than normal. What am I doing wrong in following code?

 
#!/usr/bin/perl
use strict;
use Parallel::ForkManager;
use Bio::SeqIO;

#usage: perl testParallel.pl <multi fasta infile> download the fasta f
+ile at #http://bioinformaticsonline.com/file/view/30673/test-multifas
+ta-data

my %sequences;
my $seqio = Bio::SeqIO->new(-file => "$ARGV[0]", -format => "fasta");
while(my$seqobj = $seqio->next_seq) {
    my $id  = $seqobj->display_id;    # there's your key
    my $seq = $seqobj->seq;           # and there's your value
    $sequences{$id} = $seq;
}

  my $max_procs = 5;
  my @names = keys %sequences;

  # hash to resolve PID's back to child specific information
  my $pm =  new Parallel::ForkManager($max_procs);

 # Setup a callback for when a child finishes up so we can
  # get it's exit code
  $pm->run_on_finish (
    sub { my ($pid, $exit_code, $ident) = @_;
      #print "** $ident just got out of the pool ".
        "with PID $pid and exit code: $exit_code\n";
    }
  );

  $pm->run_on_start(
    sub { my ($pid,$ident)=@_;
     #print "** $ident started, pid: $pid\n";
    }
  );

  $pm->run_on_wait(
    sub {
      #print "** Have to wait for one children ...\n"
    },
    0.5
  );

  NAMES:
  foreach my $child ( 0 .. $#names ) {
    my $pid = $pm->start($names[$child]) and next NAMES;
    checkATCG($names[$child]);
    $pm->finish($child); # pass an exit code to finish
  }

  print "Waiting for Children...\n";
  $pm->wait_all_children;
  print "Everybody is out of the pool!\n";


sub checkATCG {
my $name=shift;
my $DNA=$sequences{$name};
my $length=length $DNA;
my $a=($DNA=~tr/A//);
my $b=($DNA=~tr/C//);
my $c=($DNA=~tr/G//);
my $d=($DNA=~tr/T//);
my $Total=$a+$b+$c+$d;
my $GC=($DNA=~s/GC/GC/g);
my $AT=($DNA=~s/AT/AT/g);
my $GCper=($GC/($Total)*100);
print"$name\t$Total\t$AT\t$GC\t$GCper:\n";

}
[download]

Comment on Perl::ForkManager does not speed up ATCG calculation !! Download Code

Replies are listed 'Best First'.
Re: Perl::ForkManager does not speed up ATCG calculation !! by BrowserUk (Patriarch) on Jan 26, 2017 at 14:50 UTC
What am I doing wrong in following code? You are doing so little work in each of your processes (running checkATCG() against a single sequence) that the costs of starting up and shutting down that process costs (far) more than doing the work. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Perl::ForkManager does not speed up ATCG calculation !! by davido (Cardinal) on Jan 26, 2017 at 21:34 UTC
Though these are microoptimizations, consider consolidating your `tr///` calls to `my $Total = $DNA =~ tr/ACGT//;`. Make your `s/GC/GC/g` and `s/AT/AT/g` calls look like this instead: `my $GC = () = $DNA =~ m/GC/g;`, and in the `while` loop, avoid making throwaway copies of the sequence that are only used once. The bigger issue really is that you're not as processor bound as you might think you are in the portion of the code you shifted into subprocesses. It's likely that if you were to profile (Devel::NYTProf) the code before you converted it to a forking solution, you would discover most of the time is spent making `$seqio->next_seq` calls in the `while` loop, and that's one area where there's not much you can do about it. That call is reading from a stream, and the stream probably has bandwidth limited by the characteristics of the device you are reading from, and forking would be minimally effective or even negatively impactful at the reading stage. If you are processing many files, you might be able to spread the gathering of those files across several physical devices and then fork a child for each file you wish to process. No single file would run faster, but the overall effect would probably produce savings. Dave	[reply] [d/l] [select]


Think about Loose Coupling
	PerlMonks