Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Recently I've received a lot more ithreads related questions, so I figured some background information might be in order. Some of these issues were already addressed about a year ago in Status and usefulness of ithreads in 5.8.0, but I think a recap is in order.

This is not a tutorial about the how to use threads. It's more a tutorial about how to use threads in a good way once you figured out they may hold a solution to your particular need.

First of all, if you want to do anything for production use with Perl ithreads, you should get Perl 5.8.1 (or until then, one of the recent maintenance snapshots). There were several bugs in 5.8.0, one of which was a serious memory eating bug when using shift() on a shared array, which are now fixed in 5.8.1.

However, there are still a number of caveats that you should be aware of when you want to use Perl ithreads. It's better to realize these limitations beforehand before you start to put a lot of work only to find in the end you don't have a machine big enough or fast enough to run your code in a production environment.

So what are these caveats? Basically it boils down to one statement.

Perl ithreads are not lightweight!

Unlike most other threads implementations that exist in the world, including the older perl 5.005 threads implementation, variables are by default not shared between Perl ithreads. So what does that mean? It means that every time you start a thread all data structures are copied to the new thread. And when I say all, I mean all. This e.g. includes package stashes, global variables, lexicals in scope. Everything! An example:

use threads (); my $foo; threads->new( sub {print "thread: coderef = ".\$foo."\n"} )->join; print " main: coderef = ".\$foo."\n";
which prints this on my system:
thread: coderef = SCALAR(0x1eefb4)
  main: coderef = SCALAR(0x107c90)
This shows that the lexical scalar $foo was copied to the thread. Inside the thread the "same" lexical now lives at another address and can be changed at will inside the thread without affecting the lexical in the main program. But this copying takes place when a thread is started! Not, what you might expect, at the moment the value of the lexical inside the thread has changed (which is usually referred to as COW, or Copy On Write). So, even if you never use $foo inside the thread, it is copied taking up both CPU and memory. But it gets worse: the same applies to all other forms of data. One of them being code references (as shown in this example):
use threads (); sub foo {1} threads->new( sub {print "thread: coderef = ".\&foo."\n"} )->join; print " main: coderef = ".\&foo."\n";
which prints on my system:
thread: coderef = CODE(0x1deae4)
  main: coderef = CODE(0x107c9c)
The code references are different! So, did it copy the whole subroutine? I've been led to understand that the actual opcodes of subroutines are not copied (but I've been hesitant to check in the Perl source code to actually conform this, so I'll have to take the p5pers word for it). But all the data around it, in this case the code reference in the package stash, is copied. Even if we never call foo() inside the thread!

Shared variables?
But wait, you might say, shared variables may be a lot better. So why don't I make all variables shared in my application so I won't suffer from this. Well, that is wrong. Why? Because shared variables in fact aren't shared at all. Shared variables are in fact ordinary tied variables (with all the caveats and performance issues associated with tied variables) that have some "magic" applied to them. So, not only do shared variables take up the same amount of memory as "normal" variables, they take up extra memory because all of the tied magic associated with it. This also means that you cannot have shared variables with your own tie-magic associated with it (unless you want to use my Thread::Tie module).

Implications
So what does this mean if you want to use Perl ithreads in your application? Well, you want to prevent a lot of copying of data to occur when you start a thread. One way to achieve this would be to only load modules inside the threads and after threads have started. But that's easier said than done. Observe the following code sample:

use threads (); threads->new( sub { use Benchmark; # just an example module # do your Benchmark stuff } )->join; print "Benchmark has been loaded!\n" if defined $Benchmark::VERSION;
On casual observation, you might think that would do the trick. But alas, this prints:
Benchmark has been loaded!
even though you've used the code inside the subroutine with which the thread is started! That's because use is executed at compile time. And at compile time, Perl doesn't know anything about threads yet. Of course, there is a run-time equivalent to use. This example indicates indeed that the Benchmark module has been loaded inside the thread only:
use threads (); threads->new( sub { require Benchmark; Benchmark->import; # do your Benchmark stuff } )->join; print "Benchmark has not been loaded!\n" unless defined $Benchmark::VE +RSION;
which prints:
Benchmark has not been loaded!
Since I don't particularly like the require module: module->import idiom, I actually created the Thread::Use module that allows you to use the useit module; idiom.

However, the compile time issue of use also works the other way around. Observe this example:

use threads (); threads->new( sub { print "Benchmark has been loaded!\n" if defined $Benchmark::VERSIO +N; # do your Benchmark stuff } )->join; use Benchmark;
which prints:
Benchmark has been loaded!
Again, this is caused by use being executed at compile time, before the thread is started at execution time (even though it is listed later in the code). So even putting the use statements after starting your threads, is not going to help. More drastic measures are needed. If you do not want to have all the copying of data, you need to start your threads before modules are loaded. That is possible, thanks to BEGIN {}. Observe this example:
use threads (); my $thread; BEGIN { # execute this at compile time $thread = threads->new( sub { print "Benchmark has not been loaded!\n" unless defined $Bench +mark::VERSION; # do your Benchmark stuff } ); } use Benchmark; $thread->join;
which prints:
Benchmark has not been loaded!
Scalars leaked: 1
Yikes! What is that! "Scalars leaked: 1". Well, yes, that's one of the remaining problems/features/bugs of the Perl ithreads implementation. This particularly seems to happen when you start threads at compile time. From practical experience, I must say it seems to be pretty harmless. And compared to all of the other "leaking" of memory that happen because data-structures are copied, a single leaked scalar is presumably not a lot. And the error message is probably in error in this case anyway.

Tools for ithreads
So, is programming with Perl ithreads that bad? Well, if you expect lightweight threads as you would in other programming languages: yes. If you expect everything Perl to still be everything Perl even when you're using threads, Perl ithreads will do the trick. Just put a little attention to when you start the threads and what gets loaded when and where, and you should in general be just fine. And there are some modules on CPAN to help you with the various approaches to threaded programming:

Thread::Pool
Start up a number of worker threads to which jobs can be assigned. Job results can be obtained individually if necessary, using the given job-ID. Parallel resolving of IP-numbers is a typical application for this approach.

Thread::Queue::Monitored
Process values added to a queue. Related modules are Thread::Queue::Any::Monitored which allows any Storable data-structure processed through a queue, and an alternate implementation based on Thread::Conveyor: Thread::Conveyor::Monitored. Real-time logging of events is a typical application of this approach.

fork?
Now you may wonder why Perl ithreads didn't use fork()? Wouldn't that have made a lot more sense? Well, I wasn't involved in the thread design process at the time, so I have no idea what the exact reasons were. I can think of one particular reason, and that's the communication between threads, particularly for shared variables. Particularly to get the blocking right, where one thread is waiting for one or more other threads.

Not being hindered by the reasons for not using fork(), I developed a threads drop-in replacement called forks. Initially started as a pet project to see whether it would work at all, it became a bit more serious than that. The forks.pm has the distinct advantage of being able to quickly start a thread. But that's just because it does a fork(), which in modern *nixes is very fast. The communication and blocking and shared variables are handled by a TCP connection between the threads, in which the process holding the shared variable values is the server, and all the other threads (including the "main" thread) are clients. What you win in a quickly starting thread, you lose in delays with communication. So if you're not passing around a lot of data between threads, forks.pm might be for you. And additionally, forks.pm has the advantage of not needing a thread-enabled Perl. In fact, it even runs on Perl 5.6.0!

The future?
So what can we expect in the future for Perl 5 ithreads. Well, a COWed approach to shared variables is being considered for Ponie, but that's still at least a year or so in the future. And that doesn't seem to fix the non-shared data copying problem when a thread is started. And Perl 6, you may ask? It's not clear how that is going to be accessed from the Perl 6 language, but Parrot seems to consider everything a continuation. And a thread is a special case of a continuation. So I think in Perl 6, things will be good from the start.

Liz


In reply to Things you need to know before programming Perl ithreads by liz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-03-28 11:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found