Re^3: Parallel processing on Windows
by Marshall (Canon) on Sep 20, 2022 at 23:10 UTC
|
I recently wrote some simple multi-process code at Re^7: Multiprocess - child process cannot be finished successfully. Yes, the "fake" Windows PID from a Perl fork is negative and my wait statement accounts for that. Code shown does run on Windows and should run also on Unix. This demo code worked better than I thought it would - meaning that sleep in each sub process worked and did not interfere with each other. I am not sure how sleep() is implemented on Windows, but it worked better than expected!
However, it sounds like just using threads is the best way for your compute bound project. The less data you share between threads (ideally nothing that is r/w), the better. Threads can get complicated if there is a lot of sharing going on.
This is a code fragment from some code years ago... The program has about 70,000 input strings. For each input string, it is desired to know which of the other input strings are "close enough" according to some complex rules. For each input, a regex is generated that is run against all other inputs. This is a NxN algorithm. For 70K inputs it took ~1.5 hours. I have a 4 core machine. Running 4 threads, execution time was something like 3.8x (can't get to exactly 4.0, but that is a very good result). So anyway execution time went to ~20 minutes and that was "good enough" and I stopping improving things.
Anyway see below for an example of parallelizing a number cruncher job.
### This is a non-runnable code fragment
### only to show a general idea of threads pulling work
### from a common input queue and putting results on a
### common output queue.
use Threads::Queue;
my @all_inputs; global data for all threads as read only
### Worker threads #####
my $thread_limit = 4;
my @threads;
push @threads, threads->create(sub{DoWork($workQueue, $doneQueue)}) fo
+r 1 .. $thread_limit;
foreach my $input ( @all_inputs ) #init the work queue with all input
+s
{
$workQueue->enqueue($input);
}
$workQueue->enqueue(undef) for 1 .. $thread_limit; #"work finished" m
+arkers
#each thread will
+give up
#when it sees an u
+ndef
$workQueue->end();
$_->join() for @threads; #waits for all threads to finish!
print "END of threading...\n";
#### Get results off of Queue
my @results;
while ($doneQueue->pending() && ($_ = $doneQueue->dequeue()) )
{
push (@results, $_); #results are pointers to array (AoA)
}
sub DoWork
{
my ($workQueue, $doneQueue) = @_;
while (my $input = $workQueue->dequeue())
{
return unless defined ($input); #this ends this thread's job!!
if ($input =~ m|/|)
{
$doneQueue -> enqueue([$input,'SKIPPING THIS ENTRY!']);
next;
}
my $regex = get_regex_patterns ($input);
$regex =~ s/\(|\)//g; #captured values not needed, only yes/no
my @matches = grep{ m/$regex/ and $_ ne $input}@all_inputs;
push (@matches,'') if @matches==0;
$doneQueue -> enqueue([$call,@matches]);
}
}
| [reply] [d/l] |
Re^3: Parallel processing on Windows
by eyepopslikeamosquito (Archbishop) on Sep 22, 2022 at 08:16 UTC
|
Back in my Unix days I wrote a complete TCP server in Perl! worked like a champ. Sucks that Windows doesn't have a fork/kill/wait ...
Note that you can write network servers in Perl, that work fine on both Unix and Windows,
without forking and without threads, simply by taking an event-driven approach via IO::Select.
Here's a complete working example of one I used for testing Syslog a while back: Test Syslog Server
| [reply] |
Re^3: Parallel processing on Windows
by BernieC (Pilgrim) on Sep 20, 2022 at 19:27 UTC
|
I am lookgin at the threads modules {threads and threads::shared} and its not encouraging. The threads module comes with the warning
The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that make them easy to misuse. Few people know how to use them correctly or will be able to provide help.
The use of interpreter-based threads in perl is officially discouraged.
But the doc doesn't say what you should do about the discouragement. Should I give it a try , or is there yet something else/newer for this? or is this just kinda impossible in Perl on Windows... | [reply] |
|
Why are threads "discouraged"? which also links to Trying to Understand the Discouragement of Threads.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] |
|
GNU parallel has never failed me (under Unix, of course). It is a Perl script using threads and Thread::Queue.
Reading (diagonally) the long discussion cited by choroba as to why the word "discouraged" was used, I did not find real arguments except perhaps that Threads:Shared at some time could not handle the cloning of deep, complex data structures to be shared (as I understand it, now it works) and also that you may not be able to find help.
For me Corion's Re: Parallel processing on Windows suggestion served me well for all my parallel needs. I have used How to create thread pool of ithreads (the posts by BrowserUK in there) as my starting point.
There is also marioroy's MCE which I have never used. It looks solid. See Reusable threads demo on how it is used as an alternative to the threads + Thread::Queue paradigm.
bw bliako
Edit: Another point in the long discussion mentioned above is performance of a thread-enabled perl and also the overheads of creating a new thread. The latter is mostly irrelevant when you follow the model of a pool of workers (the threads' queue) where a number of threads (workers) are created once and then keep processing your data queue. If you don't keep re-creating threads then this point is irrelevant mostly. Then you have the performance of a perl compiled to enable threads which can be really hindered by the various locks put in place to protect you against race conditions etc. in a potentially threaded environment. That penalty is irrespective of whether you use threads or not, it is whether you want Perl to be able to run threads.
| [reply] |
|
Reading (diagonally) the long discussion cited by choroba as to why the word "discouraged" was used, I did not find real arguments except perhaps that all long discussion now are its just fork users trolling threads users
| [reply] |
|
I don't really like the other explanations here, so let me try too:
Most people think of "threads" as additional execution points running around in the same code and same data as eachother. Perl does not offer that option. And, actually I'm glad it doesn't, because in the Java and C++ I've written that does true "threading" it is extremely easy to introduce bugs when touching the same data structures. Getting "threading" right is massively complicated and requires rigorous design principles and IMHO has no place in a quick-and-easy scripting language.
What Perl does offer as "ithreads" is a lot more like fork/wait. When you start an ithread, it clones the current perl program (but within the existing address space, creating a new parallel interpreter for the clone), executes in parallel, and then passes data back to the main program. You can do the same thing by creating a pipe, forking, running things, and writing the result through a pipe to the parent. ithreads make this convenient; but there are also perl modules that make fork/serialize/wait convenient.
So, what are the decision points for choosing ithreads vs. fork/wait?
- Perl ithreads clone the entire interpreter. Not just the parts you need in the thread, but the whole interpreter, which can be massive if you use big frameworks. On Linux, when you fork, the operation of cloning memory happens lazily on demand.
- Using Perl ithreads keeps all the data in the same memory address space, so it is theoretically faster to move results back to the main interpreter. On Linux, with fork, you have to serialize results to bytes, through a pipe, and de-serialize. However,
usually in my limited experience, result data is small compared to input data, so Linux fork() probably still wins.
- Enabling Perl ithreads in a build of Perl makes the whole interpreter run slower, even when threads aren't used. (for technical reasons) Performance-focused Linux users prefer ithreads to be disabled for the speed boost.
- On Windows, fork() *is* an ithread, because Windows doesn't have fork. In fact, this is the only reason ithreads were added to perl, because they already existed to support fake Windows forking.
- Windows fork() has bad side-effects that you would not expect if you were familiar with fork from Linux. For instance, file handles are shared between parent and child. If the forked child closes a file handle, the parent loses it too.
Summing it up,
- If you are on Windows and your program will only ever run on Windows, you might as well use ithreads because they are simpler than fork() and give the same result.
- If you are on Linux, compile your perl without ithreads so that it runs faster, and use perl modules to make forking/collecting data easier.
- If you want your program to run in multiple environments, use the forking perl modules, because not all perls have ithreads enabled, and the special modules usually do something more efficient than "clone everything" when starting a new worker.
| [reply] |
|
That's a nice overview (++). There are a couple of statements presented as facts which on inspection seem not to be.
However, usually result data is small compared to input data
I would certainly agree with "sometimes", but "usually" without any citation seems just to be an opinion. Perhaps the problem space in which you most work has such a feature but it would be surprising to find it to be universally (or even broadly) true.
Linux users prefer ithreads to be disabled for the speed boost.
Linux users who care about the speed boost at the expense of flexibility prefer ithreads to be disabled for the speed boost. The rest of us don't.
I'm a Linux user and am quite happy to use threads. The interface is pretty slick and for some scenarios, threads are a perfect fit. In others, forked processes are more appropriate and in those scenarios I'm happy enough to use fork instead. Horses for courses.
| [reply] |
|
Ignore that shit. It literally came about because unix users on got tired of not helping with threading questions. People who never tried threads agreed a scary warnings https://www.nntp.perl.org/group/perl.perl5.porters/2014/03/msg213382.html
Its self-admitted FEARmongering from irc burnouts -- we're tired of trying to discourage folks from using threads and how to use threads on the irc .... lets scare them in the documentation .... it doesn't belong
Well see Re^2: Splitting large array for threads. and follow deep, basically using threads has caveats so we'll misuse the "discouraged" label ... dumb
Subject: PATCH add discouragement warning to perl threads documentation
The common reactions to someone asking for help with threads even in #p5p
being: "You're doing it wrong!" or "You have brain damage!" This commit
attempts to reduce the number of such incidences by putting a huge warning
on the threads documentation that should discourage all but the most
determined. | [reply] |