Re^3: How to vectorize a subroutine--perhapse using PDL piddles or threads

I would like to loop through (or vectorize) many hundreds of values for "inputparameters"

If you need to loop creating many external processes, neither threading nor PDL will give you any benefit with that. Those external processes will already make full use of whatever cores are available so threading won't help, and PDL has nothing to offer either.

Indeed, on any single system (single or multi-core), you'll need to limit the number of concurrent processes or risk pushing your machine into 'context switch thrashing' or 'virtual memory thrashing', or both. Either of which will significantly degrade your throughput, rather than enhancing it.

With the IO involved, you may be able to gain throughput by running more than one process per core, thereby utilising cycles that would otherwise be wasted waiting for IO to complete, but in any case, the breakpoint for throughput gains will likely be low multiples of the number of cores available. Maybe 8 or 16 concurrent processes on a 4 core system; or 16 or 32 on an 8 core.

So, trying to starting all the processes to create your "hundreds of small files", more quickly than a perl loop can start them doesn't make any sense?

if you can distribute your processes across multiple systems, then there are definite gains to be made. But again, "vectorising the loop" that starts those processes doesn't make sense?

The biggest gains will always come from adapting the algorithms internal to those processes. And to do that, I'd need to see considerably more information about what is happening within the processes. Not the actual algorithms necessarially--though that is always going to be the surest example. But a fairly accurate representation of those algorithms. How much cpu-intensive work--math, regex, etc. is involved? How much IO is involved. The proportions of those and how they interleave.

For example, if the process spends: 10 seconds reading input, 500 seconds cpu intensively processing those inputs, and then 10 seconds writing output; then the scope for, and methods for parallelisation are significantly different than if the processing is read-process-write, read-process-write.

By way of example of how the details of threading can influence the benefits derived, here are two subtly different methods of parellelising the same cpu-intensive code, and the resultant benefits:

In this first sample, the shared variable $total is added to directly by each thread as it runs the sub x(), over its portion of the input ranges $v1 = 1 .. 100 X $v2 = 1 .. 100. Each time $total is added to, it has to be locked first:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use threads;
use Thread::Queue;

sub x {
    my( $str, $v1, $v2 ) = @_;
    my $result =0;
    $result += (ord substr $str, $_, 1 ) * $v1 / $v2 for 0 .. length( 
+$str ) -1;
    return $result;
}

our $T //= 4;
our $N //= 100;
my $step = int( $N / $T );

my $now = time;
my $total :shared = 0;
my $str = 'the quick brown fox jumps over the lazy dog' x 100;

my( $start, $end ) = ( 0, $step );
my @threads = map{
    my $thread = async{
        for my $v1 ( $start .. $end ) {
            for my $v2 ( 1 .. $N ) {
                lock $total;                    ## Lock
                $total += x( $str, $v1, $v2 );  ## update
            }
        }
    };
    $start += $step +1;
    $end = $start + $step < $N ? $start + $step : $N;
    $thread;
} 1 .. $T;

$_->join for @threads;

printf STDERR "With $T threads took %.6f seconds\n", time() - $now;

print $total;

__END__
c:\test>859242-2 -T=1 -N=100
With 1 threads took 14.250000 seconds
10711649268.1623

c:\test>859242-2 -T=2 -N=100
With 2 threads took 14.297000 seconds
10711649268.1623

c:\test>859242-2 -T=3 -N=100
With 3 threads took 14.476000 seconds
10711649268.1623

c:\test>859242-2 -T=4 -N=100
With 4 threads took 14.536000 seconds
10711649268.1623
[download]

The result is that rather than benefiting from being threaded across 1 to 4 cores, there is a small net loss of throughput with each new thread. This is due to 'lock contention'. The single shared variable becomes a bottleneck.

Whereas, in this example each thread adds to a local, non-shared variable $subtotal and only updates the shared variable $total occasionally. This means that locking is substantially avoided and each thread is free to run flat out for the greater majority of the time. With the result that each new thread, improves throughput by a substantial proportion up to the limit of the cores available:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use threads;
use Thread::Queue;

sub x {
    my( $str, $v1, $v2 ) = @_;
    my $result =0;
    $result += (ord substr $str, $_, 1 ) * $v1 / $v2 for 0 .. length( 
+$str ) -1;
    return $result;
}

our $T //= 4;
our $N //= 100;
my $step = int( $N / $T );

my $now = time;
my $total :shared = 0;
my $str = 'the quick brown fox jumps over the lazy dog' x 100;

my( $start, $end ) = ( 0, $step );
my @threads = map{
    my $thread = async{
        my $subtotal = 0;
        for my $v1 ( $start .. $end ) {
            for my $v2 ( 1 .. $N ) {
                $subtotal += x( $str, $v1, $v2 );  ## no lock, local u
+pdate
            }
        }
        lock $total;                              ## lock
        $total += $subtotal;                      ## update
    };
    $start += $step +1;
    $end = $start + $step < $N ? $start + $step : $N;
    $thread;
} 1 .. $T;

$_->join for @threads;

printf STDERR "With $T threads took %.6f seconds\n", time() - $now;

print $total;

__END__
c:\test>859242-2 -T=1 -N=100
With 1 threads took 14.115000 seconds
10711649268.1623

c:\test>859242-2 -T=2 -N=100
With 2 threads took 7.061000 seconds
10711649268.1623

c:\test>859242-2 -T=3 -N=100
With 3 threads took 4.827000 seconds
10711649268.1624

c:\test>859242-2 -T=4 -N=100
With 4 threads took 3.732000 seconds
10711649268.1623

c:\test>859242-2 -T=5 -N=100
With 5 threads took 4.356000 seconds
10711649268.1623
[download]

As you can see, the gains through adding extra threads are not linear. Starting a thread costs time. And there is some Perl-internals contention involved, which also diminishes the gains. However, these are relatively short runs, so these costs versus the processing time is quite high. Doing more work in each thread reduces the effect of the fixed proportion, relative to the variable proportion, and so the benefits increase.

Same as the second above, but with wider ranges:

c:\test>859242-2 -T=1 -N=250
With 1 threads took 87.819000 seconds
78267011685.3423

c:\test>859242-2 -T=2 -N=250
With 2 threads took 43.342000 seconds
78267011685.3423

c:\test>859242-2 -T=3 -N=250
With 3 threads took 29.350000 seconds
78267011685.3423

c:\test>859242-2 -T=4 -N=250
With 4 threads took 23.113000 seconds
78267011685.3423
[download]

Go to -N=1000 and the gains step up again. Still not linear, but closer.

Now, in the scenario you portray, imagine if you could do away with the overhead of starting an new process for each of those hundreds of variations. What gains might be achievable then?

My point is that the broad brush strokes you painted that scenario in make it impossible to advise you in anything other than generalities. And with multiprocessing, small changes can make huge difference to the outcome.

For us to really advise you effectively, you will need to fill in the details much more finely. It needn't be your actual code, but it will have to be something that closely approximates what your code is doing.

And as a final attempt to persuade your to expend the effort of sanitising your code to protect your algorithm, but leaving enough of the true flavour of the actual processing so that it replicates your run-times and IO/cpu ratios reasonably accurately, given 4 cores, the best improvement you get from threading is 1/4--and you won't achieve that. If you have 25 systems with 4 cores each you might hope to get (say) 1/75 th.

But many times (here at PM and elsewhere), I've achieved or seen algorithms and/or implementations achieve 1/250th, or 1/1000th of the runtime. That's pure perl, and without any threading or other parallelisation involved. And many of those improved algorithms/implementations could then be parallised to further improve the throughputs to give 1/250 * 1/4 or 1/1000 * 1/75 etc.

Unless you have a substantial HPC facility at your disposal or can afford to throw your code into someone else's cloud, you won't achieve anything like those kind of gains from parallisation alone.

So, as is becoming something of a mantra for me, the more info you can provide, the better your chances of getting good answers. So more info please.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

Comment on Re^3: How to vectorize a subroutine--perhapse using PDL piddles or threads Select or Download Code

Replies are listed 'Best First'.
Re^4: How to vectorize a subroutine--perhapse using PDL piddles or threads by Feynman (Novice) on Sep 08, 2010 at 16:51 UTC
Ok, you convinced me. You will probably find my code primitive--and to my defense I am still new to perl. I should also confess I am new to parallel computing. I am learning as I go along. I have yet to find a program that can bridge programs that take input files and create output files to other programs. I chose perl after considerable research because it was well designed for parsing quickly. That being said, hacking together this program has been a tremendous effort for me--and I must admit, I had plenty of help from other forums. I figured that should anyone else want to perform similar task as I (in this case, graphing the potential energy of a water molecule as a function of bond length and bonding angle) would go through similar struggles, so I wanted to publish my work to get some recognition (and something to put on a resume) and make my final product freely available to others who want to perform similar tasks. Since I do not know how to attach files, I will copy and past my subroutines as if they were part of the same program. `code removed` [download] These subroutines are actually organized into a couple of files and modules, but this code (I changed a few things in the first subroutine so this would be true) should actually run if you have psi3 installed and a template file in the correct directory called "psi3water". The most computational time is by far spent executing psi3--but there is not much I can do about that. It is doing some very complex calculations that are inevitably going to take considerable cpu time. When I read about threading and vectorization, I essentially thought they provided a way of executing as many independent tasks as the cpu could handle at once--thus providing a more efficient alternative to looping through these independent tasks. I apologize for my ignorance. Also, Versatility has been a goal for this project. I wanted to keep my program versatile enough to be able to fill in a wide variety of input files for a wide variety of programs and extract a wide variety of data from the output file.	[reply] [d/l]
Re^5: How to vectorize a subroutine--perhapse using PDL piddles or threads by BrowserUk (Patriarch) on Sep 08, 2010 at 19:31 UTC
Okay. You might want to remove your code, because I do not think that there is much that we can do to help you. Beyond suggesting that you should use strict and -w, which would allow you to notice things like: `open (INPUTFILE, ">$inputfile"); ... close(INTPUTFILE);` [download] And that you should be using lexical file handles within your subroutines. Being IO-bound code, there is not a lot that could be done to speed this up. I assume from your explanation that most of the CPU bound stuff is going on inside the `psi3`, and that is outside your remit to tune. Previously, from your mention of PDL, I thought that you were doing the complex math. All that leaves is a little relatively simple, IO-bound code that will not benefit from threading or PDL. Probably your best bet would be to look at Parallel::ForkManager and see if that fits with you aspirations. It would allow you to start your processes, whilst managing how many run concurrently. If you think that module fits the bill, and you need help on how to use it, I'd suggest posting a new question with that in the title, as I have no experience of it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^6: How to vectorize a subroutine--perhapse using PDL piddles or threads by Feynman (Novice) on Sep 08, 2010 at 19:49 UTC
Thank you-I removed the code. On the contrary, you saved me weeks of research. Although there is less I can do to make my code run faster--it is great to know that so I can move on and look at other aspects! Also, that forking module seems very promising. I will absolutely look into it. Thanks again	[reply]