comment on

This question involves more theory and design than a "show-me-some-code-snippets" question, so I figured Meditations would be the best place to post it. Here goes..

I have a script in development which forks off 'n' number of child processes, using Parallel::ForkManager. Each of these child processes operates on an item in a list, storing the name of the item in the list in a hash, then operates on that item, and stores the results of those operations (several other subroutines) into the hash as well. Each key in the hash is the name of the item in the list.

So far, so good, when running serially. When I use Parallel::ForkManager to fork these child processes, the hash is no longer "shared" across the children, and only the last one can write to it, or the first one. Enter IPC::Shareable. I can then do:

my %options     = (create       => 1,
                   exclusive    => 0,
                   mode         => 0666,
                   size         => 256000,
                   destroy      => 1);
 
tie my %sh_hash, 'IPC::Shareable', 'content', { %options };

# Fork and process items from the list

IPC::Shareable->clean_up;
IPC::Shareable->clean_up_all;
[download]

This allows me to write to the hash from all forked children. The problem here is that if the script prematurely dies, or the script is cancelled via a user's ^C of the script, there is a huge number of stale shared memory segments that have to be manually cleaned up by ipcrm(8). Not a pretty situation. Enter END { }. That also didn't do quite as well as I expected. I thought about putting the IPC clean_up() method in the run_on_finish() method for Parallel::ForkManager's forked children, but that wasn't as successful either.

I've started looking at other ways to accomplish this same goal, including using threads, threads::shared, IPC::ShareLite, and some other ideas. DBD::SQLite was also suggested by a few other monks, but it's quite heavy-weight (though not as heavy as a full-blown RDBMS) to put on the client. I'd rather stay as close to core as possible, and if I have to use CPAN, I'd rather not have to rely on anything XS, because of client dependancy issues.

What I need

Access to a shared set of arrays and hashes from all forked child processes and the parent, so all children can read and write to the hashes at any time, unrestricted, as well as the parent.
A way to maintain "persistance" (Storable?) across multiple invocations of the script. If I fork off 40 children, and 20 of them have completed successfully, and the user ^C's the script, I would like to pick up and run the remaining 20 that did not complete, at the next invocation of the script.
Proper non-IPC cleanup of shared memory segments, if used. It has to be portable, and this means it must run on POSIX systems and Windows systems equally.

What is the best way to do this? As it stands, each child process needs to populate a "master hash", which all other child processes will read from, and then run their own fork, and enter their own results in the "master hash". A real-world example of this is a web-spider (and yes, this code involves some of that).

A parent page is fetched, the url is put into the "master hash" as a key. The links are extracted from that page, and then for each link found, fork a child process to put their url into the "master hash" as a key, fetch those pages and process them, returning the content, content_length, status_line, and other information into the hash for that key. Seems simple, but it doesn't seem to work so cleanly when you have to fork out and share memory to ensure that the "master hash" is writable by all children.

Without using IPC or shm/shared memory, what is the best way towards accomplishing this goal, in a portable, scalable (and efficient, i.e. fast) way?

In reply to Concurrent access to temporary and persistant storage by hacker

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.