I have a script in development which forks off 'n' number of child processes, using Parallel::ForkManager. Each of these child processes operates on an item in a list, storing the name of the item in the list in a hash, then operates on that item, and stores the results of those operations (several other subroutines) into the hash as well. Each key in the hash is the name of the item in the list.
So far, so good, when running serially. When I use Parallel::ForkManager to fork these child processes, the hash is no longer "shared" across the children, and only the last one can write to it, or the first one. Enter IPC::Shareable. I can then do:
my %options = (create => 1, exclusive => 0, mode => 0666, size => 256000, destroy => 1); tie my %sh_hash, 'IPC::Shareable', 'content', { %options }; # Fork and process items from the list IPC::Shareable->clean_up; IPC::Shareable->clean_up_all;
This allows me to write to the hash from all forked children. The problem here is that if the script prematurely dies, or the script is cancelled via a user's ^C of the script, there is a huge number of stale shared memory segments that have to be manually cleaned up by ipcrm(8). Not a pretty situation. Enter END { }. That also didn't do quite as well as I expected. I thought about putting the IPC clean_up() method in the run_on_finish() method for Parallel::ForkManager's forked children, but that wasn't as successful either.
I've started looking at other ways to accomplish this same goal, including using threads, threads::shared, IPC::ShareLite, and some other ideas. DBD::SQLite was also suggested by a few other monks, but it's quite heavy-weight (though not as heavy as a full-blown RDBMS) to put on the client. I'd rather stay as close to core as possible, and if I have to use CPAN, I'd rather not have to rely on anything XS, because of client dependancy issues.
What I need
What is the best way to do this? As it stands, each child process needs to populate a "master hash", which all other child processes will read from, and then run their own fork, and enter their own results in the "master hash". A real-world example of this is a web-spider (and yes, this code involves some of that).
A parent page is fetched, the url is put into the "master hash" as a key. The links are extracted from that page, and then for each link found, fork a child process to put their url into the "master hash" as a key, fetch those pages and process them, returning the content, content_length, status_line, and other information into the hash for that key. Seems simple, but it doesn't seem to work so cleanly when you have to fork out and share memory to ensure that the "master hash" is writable by all children.
Without using IPC or shm/shared memory, what is the best way towards accomplishing this goal, in a portable, scalable (and efficient, i.e. fast) way?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Concurrent access to temporary and persistant storage
by crouchingpenguin (Priest) on Jun 03, 2003 at 18:38 UTC | |
|
Re: Concurrent access to temporary and persistant storage
by BrowserUk (Patriarch) on Jun 03, 2003 at 18:50 UTC | |
by Anonymous Monk on Jun 03, 2003 at 22:31 UTC | |
by BrowserUk (Patriarch) on Jun 04, 2003 at 00:42 UTC | |
|
Re: Concurrent access to temporary and persistant storage
by perrin (Chancellor) on Jun 03, 2003 at 20:35 UTC | |
by Anonymous Monk on Jun 03, 2003 at 22:21 UTC | |
by perrin (Chancellor) on Jun 03, 2003 at 22:29 UTC | |
by Anonymous Monk on Jun 04, 2003 at 08:26 UTC |