This question involves more theory and design than a "show-me-some-code-snippets" question, so I figured Meditations would be the best place to post it. Here goes..

I have a script in development which forks off 'n' number of child processes, using Parallel::ForkManager. Each of these child processes operates on an item in a list, storing the name of the item in the list in a hash, then operates on that item, and stores the results of those operations (several other subroutines) into the hash as well. Each key in the hash is the name of the item in the list.

So far, so good, when running serially. When I use Parallel::ForkManager to fork these child processes, the hash is no longer "shared" across the children, and only the last one can write to it, or the first one. Enter IPC::Shareable. I can then do:

my %options = (create => 1, exclusive => 0, mode => 0666, size => 256000, destroy => 1); tie my %sh_hash, 'IPC::Shareable', 'content', { %options }; # Fork and process items from the list IPC::Shareable->clean_up; IPC::Shareable->clean_up_all;

This allows me to write to the hash from all forked children. The problem here is that if the script prematurely dies, or the script is cancelled via a user's ^C of the script, there is a huge number of stale shared memory segments that have to be manually cleaned up by ipcrm(8). Not a pretty situation. Enter END { }. That also didn't do quite as well as I expected. I thought about putting the IPC clean_up() method in the run_on_finish() method for Parallel::ForkManager's forked children, but that wasn't as successful either.

I've started looking at other ways to accomplish this same goal, including using threads, threads::shared, IPC::ShareLite, and some other ideas. DBD::SQLite was also suggested by a few other monks, but it's quite heavy-weight (though not as heavy as a full-blown RDBMS) to put on the client. I'd rather stay as close to core as possible, and if I have to use CPAN, I'd rather not have to rely on anything XS, because of client dependancy issues.

What I need

What is the best way to do this? As it stands, each child process needs to populate a "master hash", which all other child processes will read from, and then run their own fork, and enter their own results in the "master hash". A real-world example of this is a web-spider (and yes, this code involves some of that).

A parent page is fetched, the url is put into the "master hash" as a key. The links are extracted from that page, and then for each link found, fork a child process to put their url into the "master hash" as a key, fetch those pages and process them, returning the content, content_length, status_line, and other information into the hash for that key. Seems simple, but it doesn't seem to work so cleanly when you have to fork out and share memory to ensure that the "master hash" is writable by all children.

Without using IPC or shm/shared memory, what is the best way towards accomplishing this goal, in a portable, scalable (and efficient, i.e. fast) way?

Replies are listed 'Best First'.
Re: Concurrent access to temporary and persistant storage
by crouchingpenguin (Priest) on Jun 03, 2003 at 18:38 UTC

    Just a few things...

    The problem here is that if the script prematurely dies, or the script is cancelled via a user's ^C of the script, there is a huge number of stale shared memory segments that have to be manually cleaned up by ipcrm(8). Not a pretty situation.

    Did you try $SIG{CHLD} = 'IGNORE'; ?

    Enter END { }. That also didn't do quite as well as I expected.

    Right, as each child calls END().

    I am looking forward to the responses you get, as I have ran into this trying to build a load testing application that needs to fork off dozens of agents and report back the statistics. I used a table within a database for storing the agents stats, and had the master controller process generate a graph from those statistics. It would be nicer to be able to pass that info back without having to rely on an external database/storage to be setup before hand.


    cp
    ----
    "Never be afraid to try something new. Remember, amateurs built the ark. Professionals built the Titanic."
Re: Concurrent access to temporary and persistant storage
by BrowserUk (Patriarch) on Jun 03, 2003 at 18:50 UTC

    Personally I would use threads for this, but if that is a problem, then you might take a look at the forks module.

    It is an emulation of threads and includes an emulation of the threads::shared functionality using sockets.

    I've no idea if this would play well with the forking you already have, but if not, you might be able to steal the sockets implementation it uses to allow processes to share hashes and arrays. Most of the relevant code is in forks::shared module I think.

    I thought it was worth a mention anyway.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      forks is not portable.

        Of that I am not sure, but I'll take your word for it.

        However, the socket code that underlies the mechanism that allows two or more forked processes to appear to share common hashes and arrays almost certainly is portable to any platform that supports sockets. If the OP has to come up with his own IPC mechanism to allowed multiple processes to share one or more hashes, the code in forks::shared would be a good place to buy wood, rather than having to grow a tree from seed.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: Concurrent access to temporary and persistant storage
by perrin (Chancellor) on Jun 03, 2003 at 20:35 UTC
    If you didn't need persistence, I would point you at IPC:MM, which has unbeatable performance. Since you do need persistence, the best choice is a dbm-based one. You can use either BerkleyDB, which has its own locking system, or MLDBM::Sync which works with all the popular dbms. These are faster than SQLite.

    Of course you can also use MySQL for this and get performance that's good enough for most things.

      If you didn't need persistence, I would point you at IPC:MM, which has unbeatable performance.
      And that would disregard the portability requirement.
      ...These are faster than SQLite.
      But neither speaks SQL. You should only consider SQLite if your aim is to have a RDMS.
        How do you know IPC::MM isn't portable? Ralf Engelschall's mm (which it is based on) was developed to provide portable shared memory for apache.

        hacker is the one who brought up SQLite. I'm just telling him how it performs relative to these other choices.