Re: Sharing large data structures between threads

Just to throw out a possible option to you, don't know if it's right for your particular needs but just want to help out. In the past I had a somewhat similar situation where I was doing parallel processing using forks and wanted to share large data structures across them that were too big to fit in my available RAM.

After experimenting with various IPC libraries on CPAN I went for KyotoCabinet it's the fastest DBM out there to my knowledge, written in C++ and fully supports a multithreaded environment. From the docs:

Functions of API are reentrant and available in multi-thread environment. Different database objects can be operated in parallel entirely. For simultaneous operations against the same database object, rwlock (reader-writer lock) is used for exclusion control. That is, while a writing thread is operating an object, other reading threads and writing threads are blocked. However, while a reading thread is operating an object, reading threads are not blocked. Locking granularity depends on data structures. The hash database uses record locking. The B+ tree database uses page locking.

In order to improve performance and concurrency, Kyoto Cabinet uses such atomic operations built in popular CPUs as atomic-increment and CAS (compare-and-swap). Lock primitives provided by the native environment such as the POSIX thread package are alternated by own primitives using CAS.

Download, build and install the source core library and then the Perl API bindings. There are quite a few options on how you want it built so please check out ./configure --help.

Hope it is fast enough for your needs, it's not as fast as RAM but solves a lot of other problems

Comment on Re: Sharing large data structures between threads

Replies are listed 'Best First'.
Re^2: Sharing large data structures between threads by hermida (Scribe) on Mar 07, 2011 at 18:48 UTC
I read that KyotoCabinet is not thread-safe for ithreads (threads, threads::shared) since ithreads are not really threads but just process emulation. Maybe try to use Coro instead of ithreads. Coro are real threads and the fastest, most reliable threading model in Perl and this should work with KyotoCabinet. And you might also get a performance boost as well :)	[reply]
Re^3: Sharing large data structures between threads by Corion (Patriarch) on Mar 07, 2011 at 19:00 UTC
Coro does not bring you any parallelism. You can at most use one CPU when using Coro (discounting external processes).	[reply]
Re^4: Sharing large data structures between threads by hermida (Scribe) on Mar 07, 2011 at 19:19 UTC
My apologies, I haven't used Coro myself but from the documentation if you are right then it is all very misleading as it does appear to provide multithreading and is a replacement for threads and threads::shared. If you read Audrey Tang's review of Coro: Wow, I can't believe no-one had reviewed this module. In short, this is what Perl Threads should work like. After the fragile-but-fast Perl 5.5 threading thesis, and the slow-but-reliable Perl 5.6 ithreading antithesis, this is the perfect synthesis that gives you a fast and reliable threading model. Highly recommended. One would think so no? Sorry for my ignorance.	[reply]
Re^5: Sharing large data structures between threads by Corion (Patriarch) on Mar 07, 2011 at 19:47 UTC
Re^6: Sharing large data structures between threads by hermida (Scribe) on Mar 07, 2011 at 19:54 UTC
Re^3: Sharing large data structures between threads by BrowserUk (Patriarch) on Mar 07, 2011 at 19:14 UTC
since ithreads are not really threads but just process emulation. Utter bollocks.	[reply]
Re^4: Sharing large data structures between threads by hermida (Scribe) on Mar 07, 2011 at 19:35 UTC
Again further down in the Coro docs: WINDOWS PROCESS EMULATION A great many people seem to be confused about ithreads (for example, Chip Salzenberg called me unintelligent, incapable, stupid and gullible, while in the same mail making rather confused statements about perl ithreads (for example, that memory or files would be shared), showing his lack of understanding of this area - if it is hard to understand for Chip, it is probably not obvious to everybody). What follows is an ultra-condensed version of my talk about threads in scripting languages given on the perl workshop 2009: The so-called "ithreads" were originally implemented for two reasons: first, to (badly) emulate unix processes on native win32 perls, and secondly, to replace the older, real thread model ("5.005-threads"). It does that by using threads instead of OS processes. The difference between processes and threads is that threads share memory (and other state, such as files) between threads within a single process, while processes do not share anything (at least not semantically). That means that modifications done by one thread are seen by others, while modifications by one process are not seen by other processes. The "ithreads" work exactly like that: when creating a new ithreads process, all state is copied (memory is copied physically, files and code is copied logically). Afterwards, it isolates all modifications. On UNIX, the same behaviour can be achieved by using operating system processes, except that UNIX typically uses hardware built into the system to do this efficiently, while the windows process emulation emulates this hardware in software (rather efficiently, but of course it is still much slower than dedicated hardware). As mentioned before, loading code, modifying code, modifying data structures and so on is only visible in the ithreads process doing the modification, not in other ithread processes within the same OS process. This is why "ithreads" do not implement threads for perl at all, only processes. What makes it so bad is that on non-windows platforms, you can actually take advantage of custom hardware for this purpose (as evidenced by the forks module, which gives you the (i-) threads API, just much faster). Sharing data is in the i-threads model is done by transfering data structures between threads using copying semantics, which is very slow - shared data simply does not exist. Benchmarks using i-threads which are communication-intensive show extremely bad behaviour with i-threads (in fact, so bad that Coro, which cannot take direct advantage of multiple CPUs, is often orders of magnitude faster because it shares data using real threads, refer to my talk for details). As summary, i-threads use threads to implement processes, while the compatible forks module uses processes to emulate, uhm, processes. I-threads slow down every perl program when enabled, and outside of windows, serve no (or little) practical purpose, but disadvantages every single-threaded Perl program. This is the reason that I try to avoid the name "ithreads", as it is misleading as it implies that it implements some kind of thread model for perl, and prefer the name "windows process emulation", which describes the actual use and behaviour of it much better.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re^4: Sharing large data structures between threads by hermida (Scribe) on Mar 07, 2011 at 19:26 UTC
In the Coro docs: Unlike the so-called "Perl threads" (which are not actually real threads but only the windows process emulation (see section of same name for more details) ported to UNIX, and as such act as processes), Coro provides a full shared address space, which makes communication between threads very easy. And coro threads are fast, too: disabling the Windows process emulation code in your perl and using Coro can easily result in a two to four times speed increase for your programs.	[reply]
Re^5: Sharing large data structures between threads by BrowserUk (Patriarch) on Mar 07, 2011 at 19:36 UTC