Re: Sharing large data structures between threads
by anonymized user 468275 (Curate) on Mar 07, 2011 at 17:23 UTC
|
An idea I have often toyed with in such situations is to daemonise the data management. So one process does nothing but insert, update, delete and fetch from the structure. The main program does an IPC::Open2 or IPC::Open3 on the data management process and the threads or forks from the main program share the filehandles to communicate with the data management process but not its memory. For example
# data manager (short separate program)
while( <> ) {
chomp;
my ( $command, $value, $address ) = split /\t/;
if ( $command eq 'insert' or $command eq 'update' ) {
eval( "$address = $value;" );
}
elsif( $command eq 'delete' ) {
delete $$address;
}
else{
eval "print \$address;"
}
}
# and the main program:
use IPC::Open2;
my $pid = open2 my $wh, my $rh, 'DataManager.pl';
...
print $wh "insert\t\"fred\"\t" . '$deepthing01{ fred }{ bert }{ pete }
+[99]';
...
| [reply] [d/l] |
|
|
Thank you for your reply!
Your idea made me think: what if the data management was "daemonised" to a single thread instead. I would have my data structure alive in one thread only and post requests for read/write to it just as you suggested. I could even make the requests Perl code that is eval()'d by the management thread.
| [reply] |
|
|
I would suggests that the requests, if you pursue that idea, “should merely be descriptive.” If the manner by which they are (presently...) implemented is by very-clever use of eval, then so be it ... but don’t scatter that implementation detail throughout all of the client code. One day, you might decide that this approach is not so Tom Swifty after all, and if this be so, it should be easily and (ahem...) “Swiftly” replaceable without impacting much source-code that is hither, tither and yon.
| |
|
|
|
|
|
OT: suggestion (Re: Sharing large data structures between threads)
by roboticus (Chancellor) on Mar 07, 2011 at 17:57 UTC
|
This post is somewhat off topic, as you specifically requested responses discussing sharing large data structures between multiple threads.
Generally when things get this big, I turn to a database to do the locking and caching. If you use an SQL database, use transactions to handle the locking, and let the database+OS figure out which items to keep in memory. The DBM style databases probably work well for this, too. (I wouldn't know, as I don't normally use them.)
Of course, you'd need to rewrite chunks of your program to do it this way, which is a disadvantage. If you get stuck, though, and wind up rewriting bits of it, I'd suggest making some functions that abstract away the data structure so you can put in a choice of back ends to handle the storage (memory, database, flat file, etc.).
...roboticus
When your only tool is a hammer, all problems look like your thumb.
| [reply] |
Re: Sharing large data structures between threads
by hermida (Scribe) on Mar 07, 2011 at 17:57 UTC
|
Just to throw out a possible option to you, don't know if it's right for your particular needs but just want to help out. In the past I had a somewhat similar situation where I was doing parallel processing using forks and wanted to share large data structures across them that were too big to fit in my available RAM.
After experimenting with various IPC libraries on CPAN I went for KyotoCabinet it's the fastest DBM out there to my knowledge, written in C++ and fully supports a multithreaded environment. From the docs:
Functions of API are reentrant and available in multi-thread environment. Different database objects can be operated in parallel entirely. For simultaneous operations against the same database object, rwlock (reader-writer lock) is used for exclusion control. That is, while a writing thread is operating an object, other reading threads and writing threads are blocked. However, while a reading thread is operating an object, reading threads are not blocked. Locking granularity depends on data structures. The hash database uses record locking. The B+ tree database uses page locking.
In order to improve performance and concurrency, Kyoto Cabinet uses such atomic operations built in popular CPUs as atomic-increment and CAS (compare-and-swap). Lock primitives provided by the native environment such as the POSIX thread package are alternated by own primitives using CAS.
Download, build and install the source core library and then the Perl API bindings. There are quite a few options on how you want it built so please check out ./configure --help.
Hope it is fast enough for your needs, it's not as fast as RAM but solves a lot of other problems
| [reply] |
|
|
I read that KyotoCabinet is not thread-safe for ithreads (threads, threads::shared) since ithreads are not really threads but just process emulation.
Maybe try to use Coro instead of ithreads. Coro are real threads and the fastest, most reliable threading model in Perl and this should work with KyotoCabinet. And you might also get a performance boost as well :)
| [reply] |
|
|
| [reply] |
|
|
|
|
|
|
|
since ithreads are not really threads but just process emulation.
Utter bollocks.
| [reply] |
|
|
A reply falls below the community's threshold of quality. You may see it by logging in.
|
|
|
|
|
Re: Sharing large data structures between threads
by Anonymous Monk on Mar 08, 2011 at 11:04 UTC
|
Thank you all for your replies and discussion!
I went ahead and implemented the one-thread data management queue. Much to my surprise, however, it didn't result in any perceived change in memory usage! It looks like Perl does some intelligent things in copying shared structures, probably something like Copy-On-Read.
Because of this, I then expected my long-term memory usage to go down. But after 10 hours of run-time that didn't seem to happen either...
I assume my usage pattern of the structure made it resistant to memory bloat. The structure is a kind of work queue that threads take work items from. They go through the structure periodically to check for valid elements, but this accesses only a small part of the data. When the rest of the data is read, it is discarded shortly after.
Now it seems the worst cause of bloat are the modules copied for each thread. I have minimized the number of use'd modules, but they still amount to about 5 MB per thread. With 200 threads this uses half the available 32-bit address space...
Optimizing modules goes out of the scope of my question, so I'm not going to dwell on that further. Suffice it to say that I have options there as I am using LWP::UserAgent which can be replaced by other modules like LWP::Parallel or HTTP::Async or even my own socket code.
| [reply] |
|
|
Now it seems the worst cause of bloat are the modules copied for each thread. I have minimized the number of use'd modules, but they still amount to about 5 MB per thread. With 200 threads this uses half the available 32-bit address space...
5MB per kernel thread is hardly huge for an interpreted language.
You made it clear in your OP that you don't want to hear this which is why I haven't responded directly earlier, but your memory problems are a direct result of the bad architecture of your code.
Simply stated, there is no point in running 200 concurrent threads. You mention LWP, so I infer from that your program is doing some sort of downloading. If you have a 100mb/s connection, by running 200 concurrent threads, you've effectively given each thread the equivalent of a dial-up connection. So, in addition to chewing up more memory, throwing threads at your download task simply means that each thread takes longer and longer to complete its individual download.
In addition to your listed alternative possibilities of LWP::Parallel and HTTP::Async, there is another. That of using a sensible number of threads in a thread pool arrangement. Which for this type of IO-bound processing usually works out to be somewhere in the order of 2x to 5x times the number of cores your box has. For today's typical hardware that means 8 to 20 will give you the best throughput.
Using your own numbers, that will need < 100MB of memory, leaving at least 1.9GB available for your data.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
If you have a 100mb/s connection, by running 200 concurrent threads, you've effectively given each thread the equivalent of a dial-up connection.
If all the threads were doing active work (like downloading), I would have no problems. The problems arose when some servers time out on me. I must use long timeouts (at least 5 minutes), and I have to retry the operations with not too much of a delay. This is what I have tried to counter by running a larger number of threads (100 is the maximum for now).
That of using a sensible number of threads in a thread pool arrangement.
But then I cannot use LWP. In fact, I do use a thread pool at the moment, my worker threads can pick up any kind of work (LWP fetch or a database fetch). The current worker thread code is very straightforward because it blocks on all I/O. To do non-blocking I/O, I will have to process several work items simultaneously, making the process quite messy. But I understand I have to have to go that route.
| [reply] |
|
|
|
|
It still disturbs me that you have two hundred threads... but, since you have specifically excluded consideration of this aspect of the system in your OP, I suppose that it cannot be reconsidered. But if it could...
The Windows memory-manager is known to be quite “lazy” and sometimes this characteristic produces memory-consumption that is quite larger than you intuitively suppose it should be. I also suspect that it over-states the “amount used.”
My instincts still tell me – nay, they fairly scream at me – that you are going to have to find a way to reduce the total number of threads that are active at one time. It has consistently been my experience over many years that a limited (and fairly small) number of threads should be put to work against an arbitrary (and perhaps quite large) number of things-to-do. In the Windows environments, this seems to be particularly important.
| |
|
|
| [reply] |
Re: Sharing large data structures between threads
by locked_user sundialsvc4 (Abbot) on Mar 09, 2011 at 14:38 UTC
|
Okay then, if the number of threads is excluded as a matter of discussion, what about the roles of these various threads? Are they all simply clones of one another? Would it be possible, for example, to delegate one thread that actually owns the data-structure, and that provides information about it to the other processes on-request?
I really get nervous about “highly pervasive” code changes, like rewriting the code to use one-dimensional hashes and so forth. Even though they might “solve” the problem, they tend to introduce a lot of instability into the code. Per contra, the “arkane magick” of accessing the data-structure could perhaps be spun off into a single object, which encapsulates all of the necessary incantations for actually getting to the thing (or perhaps to a relevant slice of it). That would allow you to utter some pretty darned outrageous “spells,” and since you are only uttering them in one place, to still keep the code maintainable.
The notion of using tie is a good one, because the tie mechanism is pretty darned magickal already. I would definitely explore that idea further.
| |
|
|
The Windows memory-manager is known to be quite “lazy” and sometimes this characteristic produces memory-consumption that is quite larger than you intuitively suppose it should be.
My original problems started with "Out of memory" errors caused by Perl having eaten its 32-bit address space... Yes, I did set the thread stack_size ages ago.
Would it be possible, for example, to delegate one thread that actually owns the data-structure, and that provides information about it to the other processes on-request?
Yes, I tried this. The results were not as good as I expected. See a few post up.
I have now been working on using HTTP::Async to reduce the number of threads. After several fixes to HTTP::Async itself, I am starting to see improvement. While I am pretty happy with the results, I do regret the additinal layer of complexity I had to incorporate into my script.
To BrowserUk: I fully appreciate your comments but I don't think it would be fruitful to discuss the application design here. For starters, if starting ground-up now, I would most certainly do many things differently. :-)
| [reply] |
|
|
Strictly speaking, by “lazy” I simply meant that it does not hurry-up to reclaim memory until it is truly pressured to do so.
“It is,” as a friend of mine would once diplomatically put it, “suboptimal ...”
(I think he was secretly a Vulcan.)
Nevertheless, and (of course) knowing nothing at all about the particulars of this application, I do think that you are compelled to find a practical way to “throttle” the maximum number of threads that exist, and to use some kind of a queue to absorb the moment-to-moment fluctuations in the incoming workload. Not an ideal situation, perhaps, but I see little available choice in the matter.
| |