learnedbyerror has asked for the wisdom of the Perl Monks concerning the following question:

Oh Monks,

Yet again, I am contemplating going where the wise fear to tread - should I override fork?

I am building a module to provide shared tied variable(s), using BerkeleyDB, across a fork call. I would like to make this as seamless as possible in using it. However, the only way that I can see to make it seamless at this time is to override the perl fork function, a la forks::BerkeleyDB

The thought of overriding a core function is, as it should, raising the hair on the back of my neck and making me nervous.

My thought is to do something like:

sub _fork { ### safely sync/close databases, close environment ### _untie_shared_vars(); _close_BerkeleyDB_env(); ### do the fork ### my $pid = CORE::fork; if (!defined $pid || $pid) { #in parent ### re-open environment and immediately retie shared variables + ### _open_BerkeleyDB_env(); _tie_share_vars(); } elsif ( $pid == 0 ) { # in child ### open environment and immediately tie shared variables ### _open_BerkeleyDB_env(); _tie_share_vars(); } else { croak( "Unable to fork" ); } return $pid; };

My question to you is am I missing a less drastic option or am I worrying too much about the override?

Thanks in advance for your guidance.

lbe

The following is a draft readme that I plan on including with the distribution.

This distribution is a work in progress (started 12/11/11). There will be more to come.

The goal of this distribution is to provide an easy means to share data structures between processes and threads. It does so using objects with convenience methods that are tied to BerkeleyDB hashes or recnos. This functionaly already exists for threads using threads::shared which uses shared memory (RAM). This distribution may be useful in threads when the hash(es) and/or array(s) is too large to be stored in RAM.

Additionally, this distribution provides a queue module, similar to Threads::Queue, that can be used across processes.

The data store of all objects are based upon Berkeley DB Concurrent DataStore (CDS). The module handles all locking needed to insure that only a single writer is allowed at any one time. The selection of CDS was made to favor speed over absolute integrity. This means that if an error occurs while a change is being written to the database, that the database will be left in an uncertain state. Given the overall stability of BerkeleyDB code, this is unlikely, but still possible. If absolute reliability is required, then one should use the BerkeleyDB directly and make use of its Transacational Data Store (TDS) capability.

As stated above, it is the author's intent that this model be used between processes/threads; hence "thread safe" and "fork safe" are goals that must be achieved in order to be successful. Care has been taken to insure that this module achieves this functionality; however,given the lack of precisely clear definitions for either thread or fork safety, it is very possible that the author has not adequately contemplated situations that may cause deadlock or race problems. As such, the author welcomes any feedback, preferably with corrected code, to address and tests, to validate, problems.

I started working on this distribution after trying many if not ever module available on CPAN that supports shared data across processes. I found three things.

  1. Robust, fast, low-level, and largely undocumented support using BerkeleyDB.
  2. A fairly functional IP::Lite based on Sqlite3 that contained a few easily correctable bugs that was not fast enough to meet my needs.
  3. A lot of pieces of functionality that didn't address the whole picture or aging stuff, like IP::Shareable that didn't simply would not work with newer versions of perl

So, I decided to try to pull my thoughts together and try to roll something of my own. I have most of the base functionality working having cobbled together portions of code from forks::BerkeleyDB and Threads::Queue and sub-classing BerkeleyDB::Hash and BerkeleyDB::Recno. The last big question that I have is how to make the implementation simple and easy to consume. My testing approach of providing methods that can be explicitly called to close my connections before forking and re-opening in the parent and child afterward is functional. But, I feel like there is, or should be, a better, less intrusive method of implementation, hence the question above.

Replies are listed 'Best First'.
Re: To override fork, or not to override fork
by Anonymous Monk on Dec 21, 2011 at 19:07 UTC
      Very interesting! I haven't seen this module before. I need to think about how it could be cleanly used without having the same impact as overriding fork in main. Any tips? Can I place all of the subroutines in a scope with this used at the top so that only forks from within these subs would be affected? If so, would the scope extend to fork made by another module like Parallel::Forker?
Re: To override fork, or not to override fork
by Tanktalus (Canon) on Dec 24, 2011 at 15:09 UTC

    On one hand, overridding CORE::GLOBAL::fork is about the only way you're going to be able to do this. On the other hand, I somehow doubt you're the only person wanting to override fork. And if more than one person wants to override fork in the same process, then something's not going to work.

    My suggestion is simply go for it. But provide for alternatives.

    I would think of an interface kind of like this: when I use Tie::SharedDB qw(:fork);, I'm asking you to override global fork. That means I assume responsibility for not colliding with anyone else. However, when I don't pass in :fork, I'm then assuming responsibility to call Tie::SharedDB::prefork() and Tie::SharedDB::postfork($rc_from_fork) myself. (The $rc_from_fork gives you the ability to do something different in the child vs the parent in the future.) Your global override, should it be used, would use the prefork/postfork functions internally as well, just to maintain consistent code paths.

    At this point, it becomes the user's choice as to which one works for them. For the majority of users, the global override will be sufficient. But for those who have competing global overrides, you provide an option.

    The problem with your thoughts on using Sub::Exporter::Lexical is that you end up having to override every module currently available on CPAN and every module that will ever be released on CPAN that uses fork. And that's just nuts. The global override does that for you, more or less, for the vast majority of the cases, while the more-cumbersome prefork/postfork functions allows the user to most likely be able to handle the rest.

    Another possibility is to create an external fork module that does the override. Say "Fork::Common" or something, where it registers pre- and post- fork duties. If this module is available (checking eval { require Fork::Common; 1 } would suffice), it would allow you to register your pre/post fork functions with it, and not override the global fork. And then, theoretically, other modules that want to override fork could do likewise, and we'd end up with a single override of CORE::GLOBAL::fork that everyone could share. Of course, this would require other modules that need overrides to switch, but you'd still have your pre/post fork functions available in the meantime.