comment on

I would like to loop through (or vectorize) many hundreds of values for "inputparameters"

If you need to loop creating many external processes, neither threading nor PDL will give you any benefit with that. Those external processes will already make full use of whatever cores are available so threading won't help, and PDL has nothing to offer either.

Indeed, on any single system (single or multi-core), you'll need to limit the number of concurrent processes or risk pushing your machine into 'context switch thrashing' or 'virtual memory thrashing', or both. Either of which will significantly degrade your throughput, rather than enhancing it.

With the IO involved, you may be able to gain throughput by running more than one process per core, thereby utilising cycles that would otherwise be wasted waiting for IO to complete, but in any case, the breakpoint for throughput gains will likely be low multiples of the number of cores available. Maybe 8 or 16 concurrent processes on a 4 core system; or 16 or 32 on an 8 core.

So, trying to starting all the processes to create your "hundreds of small files", more quickly than a perl loop can start them doesn't make any sense?

if you can distribute your processes across multiple systems, then there are definite gains to be made. But again, "vectorising the loop" that starts those processes doesn't make sense?

The biggest gains will always come from adapting the algorithms internal to those processes. And to do that, I'd need to see considerably more information about what is happening within the processes. Not the actual algorithms necessarially--though that is always going to be the surest example. But a fairly accurate representation of those algorithms. How much cpu-intensive work--math, regex, etc. is involved? How much IO is involved. The proportions of those and how they interleave.

For example, if the process spends: 10 seconds reading input, 500 seconds cpu intensively processing those inputs, and then 10 seconds writing output; then the scope for, and methods for parallelisation are significantly different than if the processing is read-process-write, read-process-write.

By way of example of how the details of threading can influence the benefits derived, here are two subtly different methods of parellelising the same cpu-intensive code, and the resultant benefits:

In this first sample, the shared variable $total is added to directly by each thread as it runs the sub x(), over its portion of the input ranges $v1 = 1 .. 100 X $v2 = 1 .. 100. Each time $total is added to, it has to be locked first:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use threads;
use Thread::Queue;

sub x {
    my( $str, $v1, $v2 ) = @_;
    my $result =0;
    $result += (ord substr $str, $_, 1 ) * $v1 / $v2 for 0 .. length( 
+$str ) -1;
    return $result;
}

our $T //= 4;
our $N //= 100;
my $step = int( $N / $T );

my $now = time;
my $total :shared = 0;
my $str = 'the quick brown fox jumps over the lazy dog' x 100;

my( $start, $end ) = ( 0, $step );
my @threads = map{
    my $thread = async{
        for my $v1 ( $start .. $end ) {
            for my $v2 ( 1 .. $N ) {
                lock $total;                    ## Lock
                $total += x( $str, $v1, $v2 );  ## update
            }
        }
    };
    $start += $step +1;
    $end = $start + $step < $N ? $start + $step : $N;
    $thread;
} 1 .. $T;

$_->join for @threads;

printf STDERR "With $T threads took %.6f seconds\n", time() - $now;

print $total;

__END__
c:\test>859242-2 -T=1 -N=100
With 1 threads took 14.250000 seconds
10711649268.1623

c:\test>859242-2 -T=2 -N=100
With 2 threads took 14.297000 seconds
10711649268.1623

c:\test>859242-2 -T=3 -N=100
With 3 threads took 14.476000 seconds
10711649268.1623

c:\test>859242-2 -T=4 -N=100
With 4 threads took 14.536000 seconds
10711649268.1623
[download]

The result is that rather than benefiting from being threaded across 1 to 4 cores, there is a small net loss of throughput with each new thread. This is due to 'lock contention'. The single shared variable becomes a bottleneck.

Whereas, in this example each thread adds to a local, non-shared variable $subtotal and only updates the shared variable $total occasionally. This means that locking is substantially avoided and each thread is free to run flat out for the greater majority of the time. With the result that each new thread, improves throughput by a substantial proportion up to the limit of the cores available:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use threads;
use Thread::Queue;

sub x {
    my( $str, $v1, $v2 ) = @_;
    my $result =0;
    $result += (ord substr $str, $_, 1 ) * $v1 / $v2 for 0 .. length( 
+$str ) -1;
    return $result;
}

our $T //= 4;
our $N //= 100;
my $step = int( $N / $T );

my $now = time;
my $total :shared = 0;
my $str = 'the quick brown fox jumps over the lazy dog' x 100;

my( $start, $end ) = ( 0, $step );
my @threads = map{
    my $thread = async{
        my $subtotal = 0;
        for my $v1 ( $start .. $end ) {
            for my $v2 ( 1 .. $N ) {
                $subtotal += x( $str, $v1, $v2 );  ## no lock, local u
+pdate
            }
        }
        lock $total;                              ## lock
        $total += $subtotal;                      ## update
    };
    $start += $step +1;
    $end = $start + $step < $N ? $start + $step : $N;
    $thread;
} 1 .. $T;

$_->join for @threads;

printf STDERR "With $T threads took %.6f seconds\n", time() - $now;

print $total;

__END__
c:\test>859242-2 -T=1 -N=100
With 1 threads took 14.115000 seconds
10711649268.1623

c:\test>859242-2 -T=2 -N=100
With 2 threads took 7.061000 seconds
10711649268.1623

c:\test>859242-2 -T=3 -N=100
With 3 threads took 4.827000 seconds
10711649268.1624

c:\test>859242-2 -T=4 -N=100
With 4 threads took 3.732000 seconds
10711649268.1623

c:\test>859242-2 -T=5 -N=100
With 5 threads took 4.356000 seconds
10711649268.1623
[download]

As you can see, the gains through adding extra threads are not linear. Starting a thread costs time. And there is some Perl-internals contention involved, which also diminishes the gains. However, these are relatively short runs, so these costs versus the processing time is quite high. Doing more work in each thread reduces the effect of the fixed proportion, relative to the variable proportion, and so the benefits increase.

Same as the second above, but with wider ranges:

c:\test>859242-2 -T=1 -N=250
With 1 threads took 87.819000 seconds
78267011685.3423

c:\test>859242-2 -T=2 -N=250
With 2 threads took 43.342000 seconds
78267011685.3423

c:\test>859242-2 -T=3 -N=250
With 3 threads took 29.350000 seconds
78267011685.3423

c:\test>859242-2 -T=4 -N=250
With 4 threads took 23.113000 seconds
78267011685.3423
[download]

Go to -N=1000 and the gains step up again. Still not linear, but closer.

Now, in the scenario you portray, imagine if you could do away with the overhead of starting an new process for each of those hundreds of variations. What gains might be achievable then?

My point is that the broad brush strokes you painted that scenario in make it impossible to advise you in anything other than generalities. And with multiprocessing, small changes can make huge difference to the outcome.

For us to really advise you effectively, you will need to fill in the details much more finely. It needn't be your actual code, but it will have to be something that closely approximates what your code is doing.

And as a final attempt to persuade your to expend the effort of sanitising your code to protect your algorithm, but leaving enough of the true flavour of the actual processing so that it replicates your run-times and IO/cpu ratios reasonably accurately, given 4 cores, the best improvement you get from threading is 1/4--and you won't achieve that. If you have 25 systems with 4 cores each you might hope to get (say) 1/75 th.

But many times (here at PM and elsewhere), I've achieved or seen algorithms and/or implementations achieve 1/250th, or 1/1000th of the runtime. That's pure perl, and without any threading or other parallelisation involved. And many of those improved algorithms/implementations could then be parallised to further improve the throughputs to give 1/250 * 1/4 or 1/1000 * 1/75 etc.

Unless you have a substantial HPC facility at your disposal or can afford to throw your code into someone else's cloud, you won't achieve anything like those kind of gains from parallisation alone.

So, as is becoming something of a mantra for me, the more info you can provide, the better your chances of getting good answers. So more info please.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

In reply to Re^3: How to vectorize a subroutine--perhapse using PDL piddles or threads by BrowserUk
in thread How to vectorize a subroutine--perhapse using PDL piddles or threads by Feynman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.