perlmeditation
liz
Recently I've received a lot more ithreads related questions, so I figured some background information might be in order. Some of these issues were already addressed about a year ago in [id://181655], but I think a recap is in order.
<P>
This is not a tutorial about the how to use threads. It's more a tutorial about how to use threads in a good way once you figured out they may hold a solution to your particular need.
<P>
First of all, if you want to do anything for production use with Perl ithreads, you should get Perl 5.8.1 (or until then, one of the recent maintenance snapshots). There were several bugs in 5.8.0, one of which was a serious memory eating bug when using shift() on a shared array, which are now fixed in 5.8.1.
<P>
However, there are still a number of caveats that you should be aware of when you want to use Perl ithreads. It's better to realize these limitations beforehand before you start to put a lot of work only to find in the end you don't have a machine big enough or fast enough to run your code in a production environment.
<P>
So what are these caveats? Basically it boils down to one statement.
<P>
<CENTER><H2>Perl ithreads are <I>not</I> lightweight!</H2></CENTER>
<readmore>
<P>
Unlike most other threads implementations that exist in the world, including the older perl 5.005 threads implementation, variables are by default <B>not</B> shared between Perl ithreads. So what does that mean? It means that every time you start a thread <B>all data structures are copied</B> to the new thread. And when I say all, I mean <B>all</B>. This e.g. includes package stashes, global variables, lexicals in scope. Everything! An example:
<code>
use threads ();
my $foo;
threads->new( sub {print "thread: coderef = ".\$foo."\n"} )->join;
print " main: coderef = ".\$foo."\n";
</code>
which prints this on my system:
<pre>
thread: coderef = SCALAR(0x1eefb4)
main: coderef = SCALAR(0x107c90)
</pre>
This shows that the lexical scalar $foo was copied to the thread. Inside the thread the "same" lexical now lives at another address and can be changed at will inside the thread without affecting the lexical in the main program. But this copying takes place when a thread is started! Not, what you might expect, at the moment the value of the lexical inside the thread has changed (which is usually referred to as COW, or Copy On Write). So, even if you never use <code>$foo</code> inside the thread, it is copied taking up both CPU and memory. But it gets worse: the same applies to all other forms of data. One of them being code references (as shown in this example):
<code>
use threads ();
sub foo {1}
threads->new( sub {print "thread: coderef = ".\&foo."\n"} )->join;
print " main: coderef = ".\&foo."\n";
</code>
which prints on my system:
<pre>
thread: coderef = CODE(0x1deae4)
main: coderef = CODE(0x107c9c)
</pre>
The code references are different! So, did it copy the whole subroutine? I've been led to understand that the actual opcodes of subroutines are not copied (but I've been hesitant to check in the Perl source code to actually conform this, so I'll have to take the p5pers word for it). But all the data around it, in this case the code reference in the package stash, <B>is</B> copied. Even if we never call <code>foo()</code> inside the thread!
<P>
<B>Shared variables?</B><BR>
But wait, you might say, shared variables may be a lot better. So why don't I make all variables shared in my application so I won't suffer from this. Well, that is <B>wrong</B>. Why? Because shared variables in fact aren't shared at all. Shared variables are in fact ordinary tied variables (with all the caveats and performance issues associated with tied variables) that have some "magic" applied to them. So, not only do shared variables take up the same amount of memory as "normal" variables, they take up <B>extra</B> memory because all of the tied magic associated with it. This also means that you cannot have shared variables with your own tie-magic associated with it (unless you want to use my [cpan://Thread::Tie] module).
<P>
<B>Implications</B><BR>
So what does this mean if you want to use Perl ithreads in your application? Well, you want to prevent a lot of copying of data to occur when you start a thread. One way to achieve this would be to only load modules inside the threads and after threads have started. But that's easier said than done. Observe the following code sample:
<code>
use threads ();
threads->new( sub {
use Benchmark; # just an example module
# do your Benchmark stuff
} )->join;
print "Benchmark has been loaded!\n" if defined $Benchmark::VERSION;
</code>
On casual observation, you might think that would do the trick. But alas, this prints:
<pre>
Benchmark has been loaded!
</pre>
even though you've used the code inside the subroutine with which the thread is started! That's because <code>use</code> is executed at compile time. And at compile time, Perl doesn't know anything about threads yet. Of course, there is a run-time equivalent to <code>use</code>. This example indicates indeed that the Benchmark module has been loaded inside the thread only:
<code>
use threads ();
threads->new( sub {
require Benchmark; Benchmark->import;
# do your Benchmark stuff
} )->join;
print "Benchmark has not been loaded!\n" unless defined $Benchmark::VERSION;
</code>
which prints:
<pre>
Benchmark has not been loaded!
</pre>
Since I don't particularly like the <code>require module: module->import</code> idiom, I actually created the [cpan://Thread::Use] module that allows you to use the <code>useit module;</code> idiom.
<P>
However, the compile time issue of <code>use</code> also works the other way around. Observe this example:
<code>
use threads ();
threads->new( sub {
print "Benchmark has been loaded!\n" if defined $Benchmark::VERSION;
# do your Benchmark stuff
} )->join;
use Benchmark;
</code>
which prints:
<pre>
Benchmark has been loaded!
</pre>
Again, this is caused by <code>use</code> being executed at compile time, <B>before</B> the thread is started at execution time (even though it is listed later in the code). So even putting the <code>use</code> statements after starting your threads, is not going to help. More drastic measures are needed. If you do not want to have all the copying of data, you need to start your threads <B>before</B> modules are loaded. That is possible, thanks to <code>BEGIN {}</code>. Observe this example:
<code>
use threads ();
my $thread;
BEGIN { # execute this at compile time
$thread = threads->new( sub {
print "Benchmark has not been loaded!\n" unless defined $Benchmark::VERSION;
# do your Benchmark stuff
} );
}
use Benchmark;
$thread->join;
</code>
which prints:
<pre>
Benchmark has not been loaded!
Scalars leaked: 1
</pre>
Yikes! What is that! "Scalars leaked: 1". Well, yes, that's one of the remaining problems/features/bugs of the Perl ithreads implementation. This particularly seems to happen when you start threads at compile time. From practical experience, I must say it seems to be pretty harmless. And compared to all of the other "leaking" of memory that happen because data-structures are copied, a single leaked scalar is presumably not a lot. And the error message is probably in error in this case anyway.
<P>
<B>Tools for ithreads</B><BR>
So, is programming with Perl ithreads that bad? Well, if you expect lightweight threads as you would in other programming languages: yes. If you expect everything Perl to still be everything Perl even when you're using threads, Perl ithreads will do the trick. Just put a little attention to when you start the threads and what gets loaded when and where, and you should in general be just fine. And there are some modules on CPAN to help you with the various approaches to threaded programming:
<DL>
<DT><B>[cpan://Thread::Pool]</B>
<DD>Start up a number of worker threads to which jobs can be assigned. Job results can be obtained individually if necessary, using the given job-ID. Parallel resolving of IP-numbers is a typical application for this approach.
<P>
<DT><B>[cpan://Thread::Queue::Monitored]</B>
<DD>Process values added to a queue. Related modules are [cpan://Thread::Queue::Any::Monitored] which allows any Storable data-structure processed through a queue, and an alternate implementation based on [cpan://Thread::Conveyor]: [cpan://Thread::Conveyor::Monitored]. Real-time logging of events is a typical application of this approach.
</DL>
<P>
<B>fork?</B><BR>
Now you may wonder why Perl ithreads didn't use fork()? Wouldn't that have made a lot more sense? Well, I wasn't involved in the thread design process at the time, so I have no idea what the exact reasons were. I can think of one particular reason, and that's the communication between threads, particularly for shared variables. Particularly to get the blocking right, where one thread is waiting for one or more other threads.
<P>
Not being hindered by the reasons for not using fork(), I developed a threads drop-in replacement called [cpan://forks]. Initially started as a pet project to see whether it would work at all, it became a bit more serious than that. The forks.pm has the distinct advantage of being able to quickly start a thread. But that's just because it does a fork(), which in modern *nixes is very fast. The communication and blocking and shared variables are handled by a TCP connection between the threads, in which the process holding the shared variable values is the server, and all the other threads (including the "main" thread) are clients. What you win in a quickly starting thread, you lose in delays with communication. So if you're not passing around a lot of data between threads, forks.pm might be for you. And additionally, forks.pm has the advantage of not needing a thread-enabled Perl. In fact, it even runs on Perl 5.6.0!
<P>
<B>The future?</B><BR>
So what can we expect in the future for Perl 5 ithreads. Well, a COWed approach to shared variables is being considered for [http://www.poniecode.org|Ponie], but that's still at least a year or so in the future. And that doesn't seem to fix the non-shared data copying problem when a thread is started. And Perl 6, you may ask? It's not clear how that is going to be accessed from the Perl 6 language, but [http://www.parrotcode.org|Parrot] seems to consider everything a [http://www.sidhe.org/~dan/blog/archives/000185.html|continuation]. And a thread is a special case of a continuation. So I think in Perl 6, things will be good from the start.
<P>
Liz
</readmore>