How to share huge data structure between threads?

ph0enix has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: How to share huge data structure between threads?
by diotalevi (Canon) on Jan 10, 2003 at 14:43 UTC

BerkeleyDB (http://www.sleepycat.com) is well suited to such an application. Unfortunately the perl module is really light on documentation. I'll provide a very quick example here but you'll really want to read the documentation on the database web site. Since the perl module is based on the C API you'll want to read the C API documentation and then just where it uses some C code, pretend it's perl code.

The one caveat is that I never use the tie interface. All it does is call the object oriented interface anyway so I save a method call and just use the database as it's designed to be used. I have an example of object oriented (though not using the CDB features) BerkeleyDB up at http://www.greentechnologist.org/tiger/unpack.pl and http://www.greentechnologist.org/tiger/graph.pl. The CDB features just "happen" if you enable them.

 use strict;
 use warnings;
 use BerkeleyDB;

 my $env = get_environment();
 
 my $db = BerkeleyDB::Btree->new (
     -Filename => 'my_file.db',
     -Flags    => DB_CREATE,
     -Env      => $env
 ) or die
 "Couldn't open database at my_file.db: $BerkeleyDB::Error";
 
 # the database now supports concurrant access. You'd
 # just open it in each thread and use it. See
 # http://www.sleepycat.com/docs/ref/cam/intro.html
 # for info on the concurrant system.

 # You can also do nested transactions and logging. See
 # http://www.sleepycat.com/docs/ref/transapp/intro.html
 # and continue next otherwise just read the docs from the
 # table of contents.

 sub get_environment {
     BerkeleyDB::Env->new (
         -Flags => DB_CREATE     |
                   DB_INIT_MPOOL |
                   DB_INIT_CDB
     ) or die
     "Couldn't initialize BerkeleyDB environment: $BerkeleyDB::Error";
 }
[download]

Update I should add that the SleepyCat documentation explicitly notes that BerkeleyDB's concurrant access modes work correctly across threads. I posted a code example for multi process access - your multi-threaded example should read similarly though there's no real reason you should need threading given your specified requirements.

Update I didn't know the perl module BerkeleyDB wasn't thread safe. The underlying library is. So if you're to follow my suggestion then probably you want multiple processes.

Fun Fun Fun in the Fluffy Chair

[reply]
[d/l]

Re: How to share huge data structure between threads?
by djantzen (Priest) on Jan 10, 2003 at 15:04 UTC

Implicit sharing of nested structures is prohibited because it creates the potential for accidential sharing of private data. Since the ithreads model is predicated upon complete separation of all data by default, to allow the capacity to implicitly share references within shared parent structures is to open the door to accidental corruption of data. From perlthrtut

use threads;
use threads::shared;
my $var           = 1;
my $svar : shared = 2;
my %hash : shared;

 ... create some threads ...

$hash{a} = 1;       # all threads see exists($hash{a}) and $hash{a} ==
+ 1
$hash{a} = $var     # okay - copy-by-value: same effect as previous
$hash{a} = $svar    # okay - copy-by-value: same effect as previous
$hash{a} = \$svar   # okay - a reference to a shared variable
$hash{a} = \$var    # This will die
delete $hash{a}     # okay - all threads will see !exists($hash{a})
[download]

So the solution using threads is to take references to the things you wish to share at each level of a parent structure and to share them on a case by case basis. In other words, you must explicitly share not only the parent reference, but every reference contained therein.

Here's some example code of a basic object with shared members:

use strict;
use warnings;
package Foo;
sub new {
    my ($class, $arg) = @_;
    my $this = bless {}, $class;
    $this->{args} = undef;
    return $this;
}
sub set {
    my ($this, $arg) = @_;
    $this->{args}[0] = $arg; # setting an entry in a shared array refe
+rence
}
1;
# End of the module, and now a test script
use strict;
use warnings;
use Foo;
use threads;
use threads::shared;

my $foo = new Foo();
my $nested_array = [];
my $nested_string = 'bar';

share($foo);
share($nested_array);
share($nested_string);

$foo->{args} = $nested_array; # set the shared array reference
# pass in a reference to the shared scalar
my $thr1 = threads->create(sub { $foo->set(\$nested_string) });
<Update>
# If in Foo::set we manually set the argument passed, say, to 'quux', 
# the object will contain that string rather than 'bar',
# proof that we do indeed have a shared nested reference.
</Update>
$thr1->join();
print $foo->{args}[0];
[download]

It's a bother to do this, but it's better than accidental trampling of data. Hope this helps.

[reply]
[d/l]
[select]

Re: How to share huge data structure between threads?
by PodMaster (Abbot) on Jan 10, 2003 at 14:53 UTC

The strategy with DB_File is to blessed(%hash)->flush after writing, and to retie before reading to ensure you got the latest data.

This will work fine but only if you use a newer version of BerkleyDB (anything about 2.5 will work fine with this technique).

If you want better transaction control, use BerkeleyDB.pm, and you got access to the full api (just go buck wild).

You other choice to consider is DBD::SQLite.

If any of this is too slow for you, you can always use Cache::Cache

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: How to share huge data structure between threads?
by dragonchild (Archbishop) on Jan 10, 2003 at 15:12 UTC

Why are you using threads instead of processes? Apache's children are processes and it's extremely robust. Apache doesn't necessarily have to serve HTML, either. It's a CGI server which can serve anything you want. And, Perl can be tightly intergrated into it.
Why not set up the shared datastructure as a SOAP process and have your children communicate with it? That way, you can even have your objects on another server and still be ok.

------
We are the carpenters and bricklayers of the Information Age.

Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

[reply]

Re: How to share huge data structure between threads?
by broquaint (Abbot) on Jan 10, 2003 at 15:30 UTC

My problem is that threads::shared can't share complex data structures and objects. How can this be solved?