Concurrent access to temporary and persistant storage

This question involves more theory and design than a "show-me-some-code-snippets" question, so I figured Meditations would be the best place to post it. Here goes..

I have a script in development which forks off 'n' number of child processes, using Parallel::ForkManager. Each of these child processes operates on an item in a list, storing the name of the item in the list in a hash, then operates on that item, and stores the results of those operations (several other subroutines) into the hash as well. Each key in the hash is the name of the item in the list.

So far, so good, when running serially. When I use Parallel::ForkManager to fork these child processes, the hash is no longer "shared" across the children, and only the last one can write to it, or the first one. Enter IPC::Shareable. I can then do:

my %options     = (create       => 1,
                   exclusive    => 0,
                   mode         => 0666,
                   size         => 256000,
                   destroy      => 1);
 
tie my %sh_hash, 'IPC::Shareable', 'content', { %options };

# Fork and process items from the list

IPC::Shareable->clean_up;
IPC::Shareable->clean_up_all;
[download]

This allows me to write to the hash from all forked children. The problem here is that if the script prematurely dies, or the script is cancelled via a user's ^C of the script, there is a huge number of stale shared memory segments that have to be manually cleaned up by ipcrm(8). Not a pretty situation. Enter END { }. That also didn't do quite as well as I expected. I thought about putting the IPC clean_up() method in the run_on_finish() method for Parallel::ForkManager's forked children, but that wasn't as successful either.

I've started looking at other ways to accomplish this same goal, including using threads, threads::shared, IPC::ShareLite, and some other ideas. DBD::SQLite was also suggested by a few other monks, but it's quite heavy-weight (though not as heavy as a full-blown RDBMS) to put on the client. I'd rather stay as close to core as possible, and if I have to use CPAN, I'd rather not have to rely on anything XS, because of client dependancy issues.

What I need

Access to a shared set of arrays and hashes from all forked child processes and the parent, so all children can read and write to the hashes at any time, unrestricted, as well as the parent.
A way to maintain "persistance" (Storable?) across multiple invocations of the script. If I fork off 40 children, and 20 of them have completed successfully, and the user ^C's the script, I would like to pick up and run the remaining 20 that did not complete, at the next invocation of the script.
Proper non-IPC cleanup of shared memory segments, if used. It has to be portable, and this means it must run on POSIX systems and Windows systems equally.

What is the best way to do this? As it stands, each child process needs to populate a "master hash", which all other child processes will read from, and then run their own fork, and enter their own results in the "master hash". A real-world example of this is a web-spider (and yes, this code involves some of that).

A parent page is fetched, the url is put into the "master hash" as a key. The links are extracted from that page, and then for each link found, fork a child process to put their url into the "master hash" as a key, fetch those pages and process them, returning the content, content_length, status_line, and other information into the hash for that key. Seems simple, but it doesn't seem to work so cleanly when you have to fork out and share memory to ensure that the "master hash" is writable by all children.

Without using IPC or shm/shared memory, what is the best way towards accomplishing this goal, in a portable, scalable (and efficient, i.e. fast) way?

Comment on Concurrent access to temporary and persistant storage Select or Download Code

Replies are listed 'Best First'.
Re: Concurrent access to temporary and persistant storage by crouchingpenguin (Priest) on Jun 03, 2003 at 18:38 UTC
Just a few things... The problem here is that if the script prematurely dies, or the script is cancelled via a user's ^C of the script, there is a huge number of stale shared memory segments that have to be manually cleaned up by ipcrm(8). Not a pretty situation. Did you try $SIG{CHLD} = 'IGNORE'; ? Enter END { }. That also didn't do quite as well as I expected. Right, as each child calls END(). I am looking forward to the responses you get, as I have ran into this trying to build a load testing application that needs to fork off dozens of agents and report back the statistics. I used a table within a database for storing the agents stats, and had the master controller process generate a graph from those statistics. It would be nicer to be able to pass that info back without having to rely on an external database/storage to be setup before hand. cp ---- "Never be afraid to try something new. Remember, amateurs built the ark. Professionals built the Titanic."	[reply]
Re: Concurrent access to temporary and persistant storage by BrowserUk (Patriarch) on Jun 03, 2003 at 18:50 UTC
Personally I would use threads for this, but if that is a problem, then you might take a look at the forks module. It is an emulation of threads and includes an emulation of the threads::shared functionality using sockets. I've no idea if this would play well with the forking you already have, but if not, you might be able to steal the sockets implementation it uses to allow processes to share hashes and arrays. Most of the relevant code is in forks::shared module I think. I thought it was worth a mention anyway. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply]
Re: Re: Concurrent access to temporary and persistant storage by Anonymous Monk on Jun 03, 2003 at 22:31 UTC
forks is not portable.	[reply]
Re: Re: Re: Concurrent access to temporary and persistant storage by BrowserUk (Patriarch) on Jun 04, 2003 at 00:42 UTC
Of that I am not sure, but I'll take your word for it. However, the socket code that underlies the mechanism that allows two or more forked processes to appear to share common hashes and arrays almost certainly is portable to any platform that supports sockets. If the OP has to come up with his own IPC mechanism to allowed multiple processes to share one or more hashes, the code in forks::shared would be a good place to buy wood, rather than having to grow a tree from seed. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply]
Re: Concurrent access to temporary and persistant storage by perrin (Chancellor) on Jun 03, 2003 at 20:35 UTC
If you didn't need persistence, I would point you at IPC:MM, which has unbeatable performance. Since you do need persistence, the best choice is a dbm-based one. You can use either BerkleyDB, which has its own locking system, or MLDBM::Sync which works with all the popular dbms. These are faster than SQLite. Of course you can also use MySQL for this and get performance that's good enough for most things.	[reply]
Re: Re: Concurrent access to temporary and persistant storage by Anonymous Monk on Jun 03, 2003 at 22:21 UTC
If you didn't need persistence, I would point you at IPC:MM, which has unbeatable performance. And that would disregard the portability requirement. ...These are faster than SQLite. But neither speaks SQL. You should only consider SQLite if your aim is to have a RDMS.	[reply]
Re: Re: Re: Concurrent access to temporary and persistant storage by perrin (Chancellor) on Jun 03, 2003 at 22:29 UTC
How do you know IPC::MM isn't portable? Ralf Engelschall's mm (which it is based on) was developed to provide portable shared memory for apache. hacker is the one who brought up SQLite. I'm just telling him how it performs relative to these other choices.	[reply]
Re: Re: Re: Re: Concurrent access to temporary and persistant storage by Anonymous Monk on Jun 04, 2003 at 08:26 UTC