http://qs1969.pair.com?node_id=1180384

jnarayan81 has asked for the wisdom of the Perl Monks concerning the following question:

I am new to parallel programming, and today I decided to test the Perl ForkManager module.

I am reading a multifasta infile and calculating ATCG percentage for each sequence. I tried to fork it to 5 different thread to speed up. Unfortunately, it takes more time with ForkManager than normal. What am I doing wrong in following code?

#!/usr/bin/perl use strict; use Parallel::ForkManager; use Bio::SeqIO; #usage: perl testParallel.pl <multi fasta infile> download the fasta f +ile at #http://bioinformaticsonline.com/file/view/30673/test-multifas +ta-data my %sequences; my $seqio = Bio::SeqIO->new(-file => "$ARGV[0]", -format => "fasta"); while(my$seqobj = $seqio->next_seq) { my $id = $seqobj->display_id; # there's your key my $seq = $seqobj->seq; # and there's your value $sequences{$id} = $seq; } my $max_procs = 5; my @names = keys %sequences; # hash to resolve PID's back to child specific information my $pm = new Parallel::ForkManager($max_procs); # Setup a callback for when a child finishes up so we can # get it's exit code $pm->run_on_finish ( sub { my ($pid, $exit_code, $ident) = @_; #print "** $ident just got out of the pool ". "with PID $pid and exit code: $exit_code\n"; } ); $pm->run_on_start( sub { my ($pid,$ident)=@_; #print "** $ident started, pid: $pid\n"; } ); $pm->run_on_wait( sub { #print "** Have to wait for one children ...\n" }, 0.5 ); NAMES: foreach my $child ( 0 .. $#names ) { my $pid = $pm->start($names[$child]) and next NAMES; checkATCG($names[$child]); $pm->finish($child); # pass an exit code to finish } print "Waiting for Children...\n"; $pm->wait_all_children; print "Everybody is out of the pool!\n"; sub checkATCG { my $name=shift; my $DNA=$sequences{$name}; my $length=length $DNA; my $a=($DNA=~tr/A//); my $b=($DNA=~tr/C//); my $c=($DNA=~tr/G//); my $d=($DNA=~tr/T//); my $Total=$a+$b+$c+$d; my $GC=($DNA=~s/GC/GC/g); my $AT=($DNA=~s/AT/AT/g); my $GCper=($GC/($Total)*100); print"$name\t$Total\t$AT\t$GC\t$GCper:\n"; }

Replies are listed 'Best First'.
Re: Perl::ForkManager does not speed up ATCG calculation !!
by BrowserUk (Patriarch) on Jan 26, 2017 at 14:50 UTC
    What am I doing wrong in following code?

    You are doing so little work in each of your processes (running checkATCG() against a single sequence) that the costs of starting up and shutting down that process costs (far) more than doing the work.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Perl::ForkManager does not speed up ATCG calculation !!
by davido (Cardinal) on Jan 26, 2017 at 21:34 UTC

    Though these are microoptimizations, consider consolidating your tr/// calls to my $Total = $DNA =~ tr/ACGT//;. Make your s/GC/GC/g and s/AT/AT/g calls look like this instead: my $GC = () = $DNA =~ m/GC/g;, and in the while loop, avoid making throwaway copies of the sequence that are only used once.

    The bigger issue really is that you're not as processor bound as you might think you are in the portion of the code you shifted into subprocesses. It's likely that if you were to profile (Devel::NYTProf) the code before you converted it to a forking solution, you would discover most of the time is spent making $seqio->next_seq calls in the while loop, and that's one area where there's not much you can do about it.

    That call is reading from a stream, and the stream probably has bandwidth limited by the characteristics of the device you are reading from, and forking would be minimally effective or even negatively impactful at the reading stage.

    If you are processing many files, you might be able to spread the gathering of those files across several physical devices and then fork a child for each file you wish to process. No single file would run faster, but the overall effect would probably produce savings.


    Dave