How to vectorize a subroutine--perhapse using PDL piddles or threads

Feynman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to vectorize a subroutine--perhapse using PDL piddles or threads by BrowserUk (Patriarch) on Sep 08, 2010 at 02:55 UTC
Be aware! "PDL threading" has nothing whatever to do with threads. They do not, for example, run concurrently. They do not allow you to make use of multiple cores. A really unfortunate choice of term IMO. Whether you can use PDL to vectorise your algorithm will depend very much upon what the algorithm does. PDL allows you to apply a calculation to a whole array 'simultaneously'. So, for example, instead of coding: `my $result = 0; for my $i ( 0 .. length( $str ) - 1 ) { $result += ord( substr $str, $i, 1 ) * 3 / 4; } return $result;` [download] it might allow you to code (something like--I've forgotten whatever little PDL syntax I once knew): `my $pdl = PDL split '', $str; my $result = sum( $pdl * 3 / 4 );` [download] In essence, this is simply moving the explicit for loop in the first example into C code. It is still looped over, but runs a more quickly because the loop is in C code, and the data is stored as a C-style array (in contiguous memory space) and is subject to far less overhead than perl scalars. So firstly, both your data and algorithm have to lend themselves to being piddled. Secondly, you'll need to re-write them to use piddles. But it is not "vectorisation" in the floating point coprocessor or graphics card sense. With FPU (SSE3 opcodes), multiple data items (up to say 256) are operated on concurrently. With GPUs (as found on graphics cards), thousands of data values can be processed concurrently. PDL does not do this. It operates on one data value at a time. The speed up comes from moving the loops into C. If you have a system available to you that has multiple cores, it would be trivial to have N copies of your existing subroutine running concurrently using threads. This should give you close to time/n speed up. If you're are seeking to spread your load across multiple systems, you'll need something more than either PDL or threads. Something along the lines of Parallel::MPI. But you need to install/develop appropriate software on each system you hope to utilise and convert your existing subroutine to MPI. This is a distinctly non-trivial process to set up. That is a very brief run through of the options. A full discussion would require a book :). If you were to post the subroutine, or something similar if you are protective of your algorithm, then it might be possible to make a clearer assessment of what might help you. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^2: How to vectorize a subroutine--perhapse using PDL piddles or threads by etj (Priest) on May 28, 2022 at 22:13 UTC
Note from THE FUTURE: In 2022, PDL "threading" was renamed to "broadcasting" in line with industry practice and to remove that confusion. However, its name was chosen with the idea that broadcasting would in fact be processed in parallel (as noted in 11144099). In 2011, much more automatic support for such parallel processing was added; in 2021 this was enabled by default for as many cores as found on the system.	[reply]
Re^2: How to vectorize a subroutine--perhapse using PDL piddles or threads by Feynman (Novice) on Sep 08, 2010 at 03:40 UTC
Wow--thank you for such a thorough response! I learned a lot from it! MPI is installed the parallel computer, I just have not yet learned how to use it. I am going to save parallelization for last. Should I fail to find (as I have so far) a module/program/script that is as good or better than what I develop, I was hoping to publish my program under GNU or some open source type system in which others can use and modify my code, but I can still receive some recognition solely for something to put on a resume. Therefore, I am a bit reluctant to put the full code up--thank you for understanding. On the other hand, I could really use the help...so how about pseudocode? sub mysub{ create a hash with keywords,inputparameters=@_; replace all kewords in some file with their respective inputparameters +>inputfile_name.ext; `some_program inputfile_name.ext outputfile`; return the number next to some string in "outputfile"; } [download] That is the jest of it. It is of course broken up into a few subroutines and they are written in two separate files, but if I condensed everything into one subroutine, that is basically what it would do. I think have the details of making, storing, and parsing those files down pretty well (and what I do not have does not seem particularly hard to code up.) I would like to loop through (or vectorize) many hundreds of values for "inputparameters"--thus creating hundreds of small temporary files (and I would delete them when done of course). If you need more details I think it would be well worth providing it, but if you can work with that, I would prefer it.	[reply] [d/l]
Re^3: How to vectorize a subroutine--perhapse using PDL piddles or threads by BrowserUk (Patriarch) on Sep 08, 2010 at 06:36 UTC
I would like to loop through (or vectorize) many hundreds of values for "inputparameters" If you need to loop creating many external processes, neither threading nor PDL will give you any benefit with that. Those external processes will already make full use of whatever cores are available so threading won't help, and PDL has nothing to offer either. Indeed, on any single system (single or multi-core), you'll need to limit the number of concurrent processes or risk pushing your machine into 'context switch thrashing' or 'virtual memory thrashing', or both. Either of which will significantly degrade your throughput, rather than enhancing it. With the IO involved, you may be able to gain throughput by running more than one process per core, thereby utilising cycles that would otherwise be wasted waiting for IO to complete, but in any case, the breakpoint for throughput gains will likely be low multiples of the number of cores available. Maybe 8 or 16 concurrent processes on a 4 core system; or 16 or 32 on an 8 core. So, trying to starting all the processes to create your "hundreds of small files", more quickly than a perl loop can start them doesn't make any sense? if you can distribute your processes across multiple systems, then there are definite gains to be made. But again, "vectorising the loop" that starts those processes doesn't make sense? The biggest gains will always come from adapting the algorithms internal to those processes. And to do that, I'd need to see considerably more information about what is happening within the processes. Not the actual algorithms necessarially--though that is always going to be the surest example. But a fairly accurate representation of those algorithms. How much cpu-intensive work--math, regex, etc. is involved? How much IO is involved. The proportions of those and how they interleave. For example, if the process spends: 10 seconds reading input, 500 seconds cpu intensively processing those inputs, and then 10 seconds writing output; then the scope for, and methods for parallelisation are significantly different than if the processing is read-process-write, read-process-write. By way of example of how the details of threading can influence the benefits derived, here are two subtly different methods of parellelising the same cpu-intensive code, and the resultant benefits: In this first sample, the shared variable `$total` is added to directly by each thread as it runs the sub x(), over its portion of the input ranges `$v1 = 1 .. 100` X `$v2 = 1 .. 100`. Each time `$total` is added to, it has to be locked first: #! perl -slw use strict; use Time::HiRes qw[ time ]; use threads; use Thread::Queue; sub x { my( $str, $v1, $v2 ) = @_; my $result =0; $result += (ord substr $str, $_, 1 ) * $v1 / $v2 for 0 .. length( +$str ) -1; return $result; } our $T //= 4; our $N //= 100; my $step = int( $N / $T ); my $now = time; my $total :shared = 0; my $str = 'the quick brown fox jumps over the lazy dog' x 100; my( $start, $end ) = ( 0, $step ); my @threads = map{ my $thread = async{ for my $v1 ( $start .. $end ) { for my $v2 ( 1 .. $N ) { lock $total; ## Lock $total += x( $str, $v1, $v2 ); ## update } } }; $start += $step +1; $end = $start + $step < $N ? $start + $step : $N; $thread; } 1 .. $T; $_->join for @threads; printf STDERR "With $T threads took %.6f seconds\n", time() - $now; print $total; __END__ c:\test>859242-2 -T=1 -N=100 With 1 threads took 14.250000 seconds 10711649268.1623 c:\test>859242-2 -T=2 -N=100 With 2 threads took 14.297000 seconds 10711649268.1623 c:\test>859242-2 -T=3 -N=100 With 3 threads took 14.476000 seconds 10711649268.1623 c:\test>859242-2 -T=4 -N=100 With 4 threads took 14.536000 seconds 10711649268.1623 [download] The result is that rather than benefiting from being threaded across 1 to 4 cores, there is a small net loss of throughput with each new thread. This is due to 'lock contention'. The single shared variable becomes a bottleneck. Whereas, in this example each thread adds to a local, non-shared variable `$subtotal` and only updates the shared variable `$total` occasionally. This means that locking is substantially avoided and each thread is free to run flat out for the greater majority of the time. With the result that each new thread, improves throughput by a substantial proportion up to the limit of the cores available: #! perl -slw use strict; use Time::HiRes qw[ time ]; use threads; use Thread::Queue; sub x { my( $str, $v1, $v2 ) = @_; my $result =0; $result += (ord substr $str, $_, 1 ) * $v1 / $v2 for 0 .. length( +$str ) -1; return $result; } our $T //= 4; our $N //= 100; my $step = int( $N / $T ); my $now = time; my $total :shared = 0; my $str = 'the quick brown fox jumps over the lazy dog' x 100; my( $start, $end ) = ( 0, $step ); my @threads = map{ my $thread = async{ my $subtotal = 0; for my $v1 ( $start .. $end ) { for my $v2 ( 1 .. $N ) { $subtotal += x( $str, $v1, $v2 ); ## no lock, local u +pdate } } lock $total; ## lock $total += $subtotal; ## update }; $start += $step +1; $end = $start + $step < $N ? $start + $step : $N; $thread; } 1 .. $T; $_->join for @threads; printf STDERR "With $T threads took %.6f seconds\n", time() - $now; print $total; __END__ c:\test>859242-2 -T=1 -N=100 With 1 threads took 14.115000 seconds 10711649268.1623 c:\test>859242-2 -T=2 -N=100 With 2 threads took 7.061000 seconds 10711649268.1623 c:\test>859242-2 -T=3 -N=100 With 3 threads took 4.827000 seconds 10711649268.1624 c:\test>859242-2 -T=4 -N=100 With 4 threads took 3.732000 seconds 10711649268.1623 c:\test>859242-2 -T=5 -N=100 With 5 threads took 4.356000 seconds 10711649268.1623 [download] As you can see, the gains through adding extra threads are not linear. Starting a thread costs time. And there is some Perl-internals contention involved, which also diminishes the gains. However, these are relatively short runs, so these costs versus the processing time is quite high. Doing more work in each thread reduces the effect of the fixed proportion, relative to the variable proportion, and so the benefits increase. Same as the second above, but with wider ranges: `c:\test>859242-2 -T=1 -N=250 With 1 threads took 87.819000 seconds 78267011685.3423 c:\test>859242-2 -T=2 -N=250 With 2 threads took 43.342000 seconds 78267011685.3423 c:\test>859242-2 -T=3 -N=250 With 3 threads took 29.350000 seconds 78267011685.3423 c:\test>859242-2 -T=4 -N=250 With 4 threads took 23.113000 seconds 78267011685.3423` [download] Go to `-N=1000` and the gains step up again. Still not linear, but closer. Now, in the scenario you portray, imagine if you could do away with the overhead of starting an new process for each of those hundreds of variations. What gains might be achievable then? My point is that the broad brush strokes you painted that scenario in make it impossible to advise you in anything other than generalities. And with multiprocessing, small changes can make huge difference to the outcome. For us to really advise you effectively, you will need to fill in the details much more finely. It needn't be your actual code, but it will have to be something that closely approximates what your code is doing. And as a final attempt to persuade your to expend the effort of sanitising your code to protect your algorithm, but leaving enough of the true flavour of the actual processing so that it replicates your run-times and IO/cpu ratios reasonably accurately, given 4 cores, the best improvement you get from threading is 1/4--and you won't achieve that. If you have 25 systems with 4 cores each you might hope to get (say) 1/75 th. But many times (here at PM and elsewhere), I've achieved or seen algorithms and/or implementations achieve 1/250th, or 1/1000th of the runtime. That's pure perl, and without any threading or other parallelisation involved. And many of those improved algorithms/implementations could then be parallised to further improve the throughputs to give 1/250 * 1/4 or 1/1000 * 1/75 etc. Unless you have a substantial HPC facility at your disposal or can afford to throw your code into someone else's cloud, you won't achieve anything like those kind of gains from parallisation alone. So, as is becoming something of a mantra for me, the more info you can provide, the better your chances of getting good answers. So more info please. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^4: How to vectorize a subroutine--perhapse using PDL piddles or threads by Feynman (Novice) on Sep 08, 2010 at 16:51 UTC
Re^5: How to vectorize a subroutine--perhapse using PDL piddles or threads by BrowserUk (Patriarch) on Sep 08, 2010 at 19:31 UTC
Some notes below your chosen depth have not been shown here
Re^3: How to vectorize a subroutine--perhapse using PDL piddles or threads by GrandFather (Saint) on Sep 08, 2010 at 05:02 UTC
On the face of it that looks like it's an I/O bound process so paralleling things up on one system probably won't get you any further ahead. On the other hand, if you have access to a network of computers and you can divvy the work up between machines then maybe you can get some benefit that way. Of course if you can avoid file I/O you may be able to get things going much faster and then threading or forking can take advantage of a multi-core system. It depends a whole bunch on what you are actually doing. True laziness is hard work	[reply]