http://qs1969.pair.com?node_id=1165955

heavenfeel has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

This is Michael, a bioinformatician work on DNA sequencing data (fastq file).

I would like to ask how could I use more than one threads for one calculation/process which could not be splited into small pieces?

Since I use one thread for 25min to filter one fastq.gz file, I would like to use more than one threads to filter one file in order to speed up the whole process. Is it possible?

Thank you very much!!!

  • Comment on Use more than one threads for one file processing

Replies are listed 'Best First'.
Re: Use more than one threads for one file processing
by hippo (Bishop) on Jun 17, 2016 at 08:37 UTC
    Since I use one thread for 25min to filter one fastq.gz file, I would like to use more than one threads to filter one file in order to speed up the whole process. Is it possible?

    Maybe, but that will depend strongly on what "filter" means. If the "filter" is a fairly trivial procedure then likely your process will be IO-bound and multiple threads won't help you. Conversely, if "filter" is CPU intensive and you have multiple cores on the hardware then threads could give an impressive speed-up.

    If you don't know how heavy "filter" is, profile it (eg. with Devel::NYTProf).

Re: Use more than one threads for one file processing
by Discipulus (Canon) on Jun 17, 2016 at 08:47 UTC
    Hello heavenfeel and welcome to the Monastery!

    generally speaking, general questions got general answers: specific questions have more change to get good quality answers.

    That said I have to admit I have no such big experience with threaded application, but I have experience with monastery, so I can provide my little help, but remember Super Search is your friend!

    What I have learned reading other monks's posts is that writing good quality Perl multithread program is not easy. But not impossible. Many monks here around are able to do it. Among them i warmly suggest posts by BrowserUk zentara marioroy NetWallah and also some practicing one like karlgoethbier. Search their posts and you'll find tresaures!.

    While BrowserUk write his softwere using standard Perl modules marioroy has developped his own module MCE that seems very interesting and probably, easier to adopt solution. MCE::Loop if i recall correctly is aimed to process in parallel large files

    Some useful reads are: Help me beat NodeJS and Fast provider feeding slow consumer but you can find more in my confusionary homenode.

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Thx a lot!! There are so many information I need to search~
Re: Use more than one threads for one file processing
by QM (Parson) on Jun 17, 2016 at 16:16 UTC
    Following up to hippo's reply, if the CPU usage is minimal with your current solution, then it's IO bound. In that case, there's not much you can do.

    If the CPU is maxed out, then multiple threads won't help either.

    It's only if you can split the file up onto multiple disks/hosts/cores, and run multiple processes, will you get much benefit. Or perhaps have a server feeding lines out to other processes on other hosts/cores, and collecting the results.

    It's difficult to get around IO bound or CPU bound problems without splitting up the work between IO systems or CPU systems.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Use more than one threads for one file processing
by marioroy (Prior) on Jun 17, 2016 at 23:53 UTC

    Hi heavenfeel,

    The following is a demonstration using the MCE::Loop module. What is nice about MCE is the ability to process a file containing multiple records. The record separator for a fastq file is "\n@" which anchors @ at the start of the line. MCE detects "\n" at the start of the record separator.

    MCE is a chunking engine allowing a worker to receive several records at a time. The effect is a reduction in the number of trips to and from the MCE-manager process.

    The logic below allows one to search multiple patterns. Simply change the patterns to suite your needs. Perhaps, this can read patterns stored in a file. Anyway, this is a small MCE demonstration. The MCE->print statement prints the entire record to STDOUT.

    use strict; use warnings; use MCE::Loop; my @patterns = ( "biopattern1", "biopattern2", "biopattern3" ); my $search = join('|', @patterns); my $regex = qr/$search/; open my $fh, "gunzip -c in.fastq.gz |" or die "open error: $!"; MCE::Loop->init( max_workers => 4, chunk_size => 50, RS => "\n@" ); MCE::Loop->run( sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; for my $i ( 0 .. $#{ $chunk_ref } ) { if ( $chunk_ref->[$i] =~ /$regex/ ) { MCE->print( $chunk_ref->[$i] ); } } }, $fh ); MCE::Loop->finish(); close $fh;
    Kind regards, Mario

      The following preserves output order if desired. Here, the chunk_id value is used for ensuring ordered entry into the @found array. It is important for workers to gather even though sending an empty @results. The reason is that the Manager process must know that a given chunk_id has completed.

      use strict; use warnings; use MCE::Loop; use MCE::Candy; my @patterns = ( "biopattern1", "biopattern2", "biopattern3" ); my $search = join('|', @patterns); my $regex = qr/$search/; open my $fh, "gunzip -c in.fastq.gz |" or die "open error: $!"; my @found; MCE::Loop->init( max_workers => 4, chunk_size => 50, RS => "\n@", gather => MCE::Candy::out_iter_array(\@found) ); MCE::Loop->run( sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my @results; for my $i ( 0 .. $#{ $chunk_ref } ) { if ( $chunk_ref->[$i] =~ /$regex/ ) { push @results, $chunk_ref->[$i]; } } MCE->gather( $chunk_id, @results ); }, $fh ); MCE::Loop->finish(); close $fh; print join('', @found);

      Next, workers send data to the manager process via MCE->gather instead of sending to STDOUT.

      use strict; use warnings; use MCE::Loop; my @patterns = ( "biopattern1", "biopattern2", "biopattern3" ); my $search = join('|', @patterns); my $regex = qr/$search/; open my $fh, "gunzip -c in.fastq.gz |" or die "open error: $!"; MCE::Loop->init( max_workers => 4, chunk_size => 50, RS => "\n@" ); my @found = MCE::Loop->run( sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; for my $i ( 0 .. $#{ $chunk_ref } ) { if ( $chunk_ref->[$i] =~ /$regex/ ) { MCE->gather( $chunk_ref->[$i] ); } } }, $fh ); MCE::Loop->finish(); close $fh; print join('', @found);
Re: Use more than one threads for one file processing
by Anonymous Monk on Jun 17, 2016 at 19:03 UTC

    The question is: how does one utilize parallel processing resources?

    The short answer is: one uses the parallelized version of the crunching routine(s). See also: Amdahl's law.

Re: Use more than one threads for one file processing
by Cow1337killr (Monk) on Jul 03, 2016 at 10:23 UTC

    If you want my honest opinion, I think the work of bioinformaticians is important, especially if it has anything remotely to do with curing some disease that my shorten MY life.

    So, I recommend that you go to your boss and tell him you either need a faster computer OR you need yet another computer.

    I think it is a waste of your precious time to be learning how to coax Perl to use multiple threads. Your time is better spent on doing what you do best: bioinformatics.

    Oh, and keep us informed.

      A lot of effort when into the MCE module to assist bioinformaticians in regards to using multiple cores. MCE includes logic allowing one to process input by specifying a record separator. In the OP's case, to anchor "@" at the start of the line: "\n@". Workers receive record(s) beginning with "@", not "\n@".

      Basically, MCE wraps serial code with little code to consume multiple cores.

      Regards, Mario.