http://qs1969.pair.com?node_id=92078

sutch has asked for the wisdom of the Perl Monks concerning the following question:

I'm about to design a web application in Perl, to run on an Apache server (Linux), which will require a large, possibly hundred megabyte, in-memory data structure. I would like to initialize the data structure once from a database or file, then allow updates to affect the structure, as well as the persistent datastore, so that all future HTTP processes can access and update the most recent data.

I've tried using mod_perl for this type of persistent, in memory data, but any data structures that are initially shared get copied once per process and become unshared as soon as the data is changed by a process.

Do any other Monks have experience with using shared Perl data structures among multiple Perl (Apache) processes? What Perl, Apache, and/or Linux options are available handling this?

Replies are listed 'Best First'.
Re: Sharing data structures among http processes?
by tomhukins (Curate) on Jun 28, 2001 at 02:38 UTC

    I've already partly answered your question in Re: (ar0n: pnotes) Re: Sharing data structures among http processes?, to clarify another response.

    To directly answer your question, I should explain that you are encountering a feature of Unix called copy on write memory.

    Apache 1.x on Unix uses a pre-forking model, whereby a parent httpd process forks off a number of child httpds each of which handle one request at a time. When a Unix process forks, another identical process is created. To save memory, this identical process shares all of its data with the parent process. However, as soon as either process changes a certain area of memory, separate copies of that memory are created for each process. Hence the name copy on write. This is what you are observing as data become unshared.

    Aside from memory requirements, the technique you are using suffers from another problem. Imagine you have 3 httpd processes (called A, B and C) serving requests, each with a name and phone number data structure. Initially, each process holds the following information:

    NAME          PHONE
    ----          -----
    Fingermouse   528
    Bergen        392
    Lobster       771
    
    If process A receives a request to change Bergen's number to 398, the data structures in processes B and C will not be affected. So, if a request to retrieve Bergen's number reaches process C, it will report that Bergen's number is still 392. Thus, it is important to share data sets between processes if the data may be written to.

      Thanks tomhukins, this looks to be what is needed. And your explanation of copy on write memory sheds light on why I was experiencing the unsharing of memory.

      I do have another related question: are there any methods for sharing a process (or a Perl object) among processes? For example, I want one shared object to update the data structure, to ensure integrity and to write the changes back to the persistent storage. Or is there a better method for handling this than one shared object attempting to service many requests?

        There are many ways to share data between processes. You can use a local dbm file. You can use IPC::Shareable. And so on. But all of the efficient ones have the rather significant problem that all of the requests in your series have to come back to the same physical machine. This does not play well with load balancing.

        However one crazy way of doing it is like this. One machine gets the request and forks off a local server. Other CGI requests are passed the necessary information to be able to access this temporary server, which is run in a persistent process, and then when this server decides the time is right, it de-instantiates itself. This would be a lot of work though.

        Personally I would just see if you can keep the temporary state in the database, and just have each individual request deal with the bit of the state that they need to handle. But I cannot, of course, offer any guesses on whether this would work without knowing more details than you have given us.

        Having trying myself to implement a semaphore-based locking mechanism for IPC SysV shared memory, I'd recommend IPC::ShareLite for general purposes. It comes with a powerful locking mechanism, which is incredibly similar to flock() !
        Apache::SharedMem also depends on this module.
(ar0n: pnotes) Re: Sharing data structures among http processes?
by ar0n (Priest) on Jun 28, 2001 at 02:08 UTC
    You may have already looked at this, but both Perl and PHP support Apache notes, which allows you to pass data (structures) between handlers. From the mod_perl guide:
    Let's say that you wrote a few handlers to process a request, and they all need to share some custom Perl data structure. The pnotes() method comes to your rescue.
    # a handler that gets executed first my %my_data = (foo => 'mod_perl', bar => 'rules'); $r->pnotes('my_data' => \%my_data);
    The handler prepares the data in hash %my_data and calls pnotes() method to store the data internally for other handlers to re-use. All the subsequently called handlers can retrieve the stored data in this way:
    my $info = $r->pnotes('my_data'); print $info->{foo};
    prints:
    mod_perl
    The stored information will be destroyed at the end of the request.
    I've been meaning to play with it for a while, but haven't gotten around to it yet. It sure looks cool ;)

    ar0n ]


    update: Hrm, I guess I misunderstood the question. Sorry 'bout that.

      This is fine for passing data between handlers or to subrequests within an individual httpd process, but it doesn't share data between separate processes, as sutch asked.

      Apache::SharedMem can be used to share data between Apache processes.

Re: Sharing data structures among http processes?
by jbert (Priest) on Jun 28, 2001 at 12:40 UTC

    Other answers have described the copy-on-write of memory in Unix and the fact that you need to use shared memory (SYSV SHM stuff or mmap'd files) and a good thing would be to use a wrapper around these things like Apache::Sharedmem.

    Shared memory is tricky stuff in a similar way that threads are tricky things, since you open yourselves up to race conditions where two processes are altering the shared memory and violate each other's assumptions.

    As a simple example, a process might increment a variable held in shared memory by 1 and assume that it has that value later on in the same routine, whereas another process might have incremented it in the meantime. Hard-to-find bugs which are cured by adding locks (semaphores, mutexes, whatever) to define critical sections of code which only one process at a time may execute. Ugh.

    There might be a simpler solution for you though. You mention that you want changes to your data to go directly to persistent store (i.e. on disk) but you also want your data to live in memory.

    I'll assume that you want the data in memory for performance reasons - i.e. you don't want to suffer a disk access per-request. But...operating systems are smart and if you have sufficient RAM on your box (say, for example, enough to build the data structure you were talking of) and you are repeatedly accessing this data then the OS should keep it all nicely in cache for you. So whilst you might be accessing hash values in a GDBM tie'd hash, the OS is doesn't bother to touch disk. When you change data, the OS has the job of getting it to disk. If your data store is a relational database similar arguments apply.

    The nice thing about this is that you get it for free. You still need to be careful in that different processes may change the underlying data store at a time which might be inconvenient for the other processes - this is where atomic transactions on databases come into play...

    There might be other reasons why you want the memory structure, but I thought it was worth a thought.

      You are correct, the shared memory structure is for performance reasons. It is for an application that I expect to be accessed often. The queries against the database are complex and will probably overload the database server so much that the required performance will not be met with a database alone.

      Your GDBM idea sounds good enough, as long as the OS can be made to share the cache among all of the processes. Will the GDBM tied hash be automatically shared (through the OS), or does that need to be shared using shared memory? Or does this method require that each process have a separate tied hash?

        The OS-level cacheing I mentioned was simply good old file-level cacheing. If your data store is held in files accessed through the file system (as is the case for simple databases like GDBM, flat file, etc) then often-used data is kept around in RAM - shared between processes.

        You still need to spend some CPU cycles in doing lookups, etc but you don't spend any I/O - which is a win.

        OK - so your back end data store is in a database which you wish to protect from the load which your hits are likely to generate. Do you know for certain this is going to be a problem? If not can you simulate some load to see?

        Presumably you don't want to cache writes - you want them to go straight to the DB.

        So you want a cacheing layer in front of your data which is shared amongst processes and invalidated correctly as data is written.

        I don't know which DB you are using but I would imagine most/many would have such a cacheing layer. If this isn't possible or it doesn't stand up to your load then the route I would probably choose is to funnel all the DB access through one daemon process which can implement your cacheing strategy and hold one copy of the cached info.

        But I wouldn't do that until it was clear that I couldn't scale my DB to meet my load or do something else...say regularly build read-only summaries of the complex queries.

        I guess it all kind of depends on the specifics of your data structures...sorry to be vague. There is a good article describing a scaling-up operation at webtechniques which seems informative to me.

Re: Sharing data structures among http processes?
by TravelByRoad (Acolyte) on Jun 28, 2001 at 18:45 UTC
    One technique that comes to mind is to stamp each record in the disk database as to when it was most recently changed, with the timestamp an indexed field.

    When the server responds to a page request, it would first query the database for recently changed records (since the last page request by this process), weave them into the in-memory data structure before doing the requested query in-memory.

    This approach would task the database server to maintain a consistent state among processes, with each process synchronizing its in-memory state with that consistent state on start of each page request response.

    TbR