in reply to Re: Multi-threads newbie questions
in thread Multi-threads newbie questions

Thank you BrowserUk. Perhaps I should use 'Thread::Queue`' then?

In any case, I will describe my application in short, as you requested:

I'm processing many genomes. Each genome is stored in a hash, which includes some basic data about the genome (organism, size etc.) and also many file location (genome sequence etc.). Each genome hash is what I previously referred to as an 'internal hash'. All those hashes are stored together in one big hash.

The 'helper' sub, which we can now call 'process_genome', takes care of a single genome. It does some stuff, including calling external scripts which e.g. convert file formats, and add key-val pairs to the genome hash, e.g. new file locations.

I would like to process all genomes. Since I have 8 cores on my server, I would like to use multi-threading. I would like to give as input a hash of (genome) hashes, and get back a similar structure, but updated.

That's all, I think.

Replies are listed 'Best First'.
Re^3: Multi-threads newbie questions
by BrowserUk (Patriarch) on Sep 20, 2010 at 12:44 UTC

    BTW. Always, the easiest, safest way to approach threading complex applications, is to write a single-threaded version that operates upon the data in a serial fashion.

    Once you have that working, if the data is truly independent, parallelising it is usually quite simple.

    For completeness, here is another example based on yours that uses a pool of threads. It is hardly more complicated than the first version:

    #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use Data::Dump qw[ pp ]; sub helper { my $Q = shift; while( my $ref = $Q->dequeue ) {; lock $ref; $ref->{NEW_KEY} = 1; } } sub my_sub { my( $ref, $n ) = @_; my $Q = new Thread::Queue; my @threads = map async( \&helper, $Q ), 1 .. $n; $Q->enqueue( values %{ $ref } ); $Q->enqueue( (undef) x $n ); $_->join for @threads; } my $hoh = { A => shared_clone( { NAME => 'aa' } ), B => shared_clone( { NAME => 'bb' } ), }; pp $hoh; my_sub( $hoh, 2 ); pp $hoh;

    The output is identical to the earlier version.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Multi-threads newbie questions
by BrowserUk (Patriarch) on Sep 20, 2010 at 12:37 UTC

    Okay. Here's a very simple example (that works :), based on yours above:

    #! perl -slw use strict; use threads; use threads::shared; use Data::Dump qw[ pp ]; sub helper { my $ref = shift; ## Not needed if no more that one thread will access each subhash ## lock $ref; $ref->{NEW_KEY} = 1; } sub my_sub { my $ref = shift; my @threads = map async( \&helper, $_ ), values %{ $ref }; $_->join for @threads; } my $hoh = { A => shared_clone( { NAME => 'aa' } ), B => shared_clone( { NAME => 'bb' } ), }; pp $hoh; my_sub( $hoh ); pp $hoh;

    Output:


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you BrowserUk.

      Now please allow me some follow-ups.

      First, regarding the creation of $hoh. I tried replacing your

      my $hoh = { A => shared_clone( { NAME => 'aa' } ), B => shared_clone( { NAME => 'bb' } ), };
      with
      my $hoh = { A => { NAME => 'aa' }, B => { NAME => 'bb' }, }; $hoh = shared_clone($hoh);
      Why doesn't this work? doesn't 'shared_clone' does deep sharing?

      This is not just a hypothetical question. I get my $hoh unshared (I might even retrieve it), that is, I do not create it at the same time I call this piece of code. So how can I take an unshared $hoh and use it here?

      Second, if I understand correctly, you start a thread for each element of hoh. There are usually a few hundreds elements there, so I guess it's not as good idea. That's why I originally wanted yo use a thread pool, until you advised me otherwise.

      Thanks again.

        Why doesn't this work?

        Really? Works for me:

        #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use Data::Dump qw[ pp ]; sub helper { my $Q = shift; while( my $ref = $Q->dequeue ) {; lock $ref; $ref->{NEW_KEY} = 1; } } sub my_sub { my( $ref, $n ) = @_; my $Q = new Thread::Queue; my @threads = map async( \&helper, $Q ), 1 .. $n; $Q->enqueue( values %{ $ref } ); $Q->enqueue( (undef) x $n ); $_->join for @threads; } my $hoh = { A => { NAME => 'aa' }, B => { NAME => 'bb' }, }; $hoh = shared_clone( $hoh ); pp $hoh; my_sub( $hoh, 2 ); pp $hoh; __END__ C:\test>junk39 { A => { # tied threads::shared::tie NAME => "aa", }, B => { # tied threads::shared::tie NAME => "bb", }, } { A => { # tied threads::shared::tie NAME => "aa", NEW_KEY => 1, }, B => { # tied threads::shared::tie NAME => "bb", NEW_KEY => 1, }, }
        Second, if I understand correctly, you start a thread for each element of hoh. There are usually a few hundreds elements there, so I guess it's not as good idea. that's why I originally wanted you use a thread pool, until you advised me otherwise.

        I didn't advice you against using a pool of threads. Only against modules that purport to make using a thread pool "simple", in the most complicated (and broken) ways imaginable.

        The code above (also posted at 860833) implements a pool of threads. I posted the non-pooled version first, in order to show how simple the transition from a non-pooled to a pooled solution is. How one is a very small extension of the other.

        And why I've never written or used a module to do it. It isn't necessary.

        Indeed, the philosophy behind most of those modules is utterly wrong. They manage the number of threads according to how much work is in the queue, for each thread!?! Which is a nonsense, because the number of cores doesn't vary. The processing power of the CPU doesn't vary.

        So, at exactly the moment when the CPU is already overloaded--as indicated by the fact the the queue is growing faster than the existing pool of threads can keep up with, what do they do? They start another thread!

        Which just means that they've stolen a bunch of cycles to start that thread. And now there is one more thread competing for resources, and having to be task switched by the system scheduler. Which simply slows the throughput of all of the threads.

        This asinine methodology is aped from "fork pool" modules, which are equally broken for the same reasons.

        The above pooling mechanism starts N threads. You the programmer decide how many threads to run by trial and error. For CPU-bound processing, start with N=cores. For IO-bound, try N=cores * 2 or 3 or 4. You'll quickly find a number that works for your application, then make it the default. Move to a different system with more or less cores, adjust it accordingly.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.