in reply to Re^4: PerlMonks Caching (still racy)
in thread PerlMonks Caching

so, we're talking about a race condition which might display outdated things in certain cases but doesn't break any data in the database.

btw., I think I like my version of using memcached more probably because it acts like a cache - requesting thread from cache - in cache ? display : get from db. actively writing all updated things to the cache is more like pushing.

For example, I have a cache of the overview page (displays the newest node of each of the sub boards), the recent 24h threads (displays a list of threads updated in the last 24 hours) and the cache of the threads themselves. When updating a thread, I delete just these three cache entries and I'm done. Following your strategy to recreate all those three entries just when updating seems to much work for me, especially if you don't know if all of these things are actually requested in the next minutes or not. Only creating when needed seems more natural to me.
And about what to cache in general, my first thing to cache here on perlmonks would be the newest nodes page, and then the RAT page. For those pages the race condition is even less important, especially it the cache just lasts for three minutes.

Replies are listed 'Best First'.
Re^6: PerlMonks Caching (push)
by tye (Sage) on Apr 21, 2010 at 20:52 UTC

    Of course it is about things being displayed badly. You don't use memcached for storing the real data, just caching it (it doesn't try to be reliable enough to be a primary source).

    I don't like updates to a shared cache to be primarily done when reading. When a cache entry expires, every read attempt is going to do extra work to construct a new cache entry until one of them finally gets the update pushed to the cache. Since the cache is fast, it is easy to get a bunch of requests noticing that the cache entry doesn't exist and then they all start the slower work of building the entry for the cache before the first can finish this slow step. That can certainly make your approach less efficient.

    I much prefer readers to get the slightly old version of things until the update has completed and been pushed to the cache. It is silly to cause extra read activity to a subset of data exactly at the time you are trying to make updates to that subset of data. That just exacerbates concurrency problems in the DB.

    Under your scheme, it would actually make more sense to not delete from the cache until after the updates to the DB have been completed.

    - tye        

      Under your scheme, it would actually make more sense to not delete from the cache until after the updates to the DB have been completed.
      Uhm, that's what I'm doing. After the transaction is committed, I do the cache delete and then the redirect.
      Anyways, let's say that's a matter of taste. I personally think it will be a great help to cache. How precisely it is done is probably not important. Good luck =)

        Ah, thanks. I see now how I misparsed your description.

        That is a better approach than the one I incorrectly thought you had originally described. Just to be clear, this (better) method also doesn't remove the race condition.

        And I still would rather not have the rush of multiple readers trying to repopulate the cache for a short period after each update. But then, I also don't have a different data structure for "display thread" separate from "update node" like you do.

        In your situation, I would prefer to have the redirect after the update be flagged as "please refresh the cache" so other readers aren't forced to hit the DB. But that presents two problems since the redirect is surely external. So, in the end, the simplest approach in your situation is your approach and I would end up using that or something very close to it.

        I personally think it will be a great help to cache. How precisely it is done is probably not important.

        We already cache and in more than one way. Just not in the way you propose.

        It seems like you might think that memcached will be a big performance win because it removes the need for the versions table. Well, I don't have to guess wildly, since I've looked into how resources are actually being used. And I wouldn't remove the versions table because that would mean that memcached failing would leave the site with no node cache and it comes to a crawl that way (little point in the site being up in that configuration) and I haven't observed the versions table to be much of a bottleneck.

        The "win" I see from memcached is, firstly, reducing the memory consumption of Apache children because they can more freely discard nodes from their per-process cache (we still need a per-process node cache as the site fundamentally works via nodes and one should try to not fetch the same node twice within a single page rendering). And I know that one of the biggest sources of "slow periods" is when one of the web servers runs too low on available memory.

        Secondly, memcached provides a much more efficient mechanism for each apache process to get an updated node after an update is done. In (unrealistic) theory, a node might not be read from the DB more than once after each update.

        And memcached should make a big improvement on how well the site "scales". As is, the memory requirements and DB through-put requirements incur a multiplier effect that probably means that twice the traffic requires more than twice the memory and DB throughput. Memcached should make that closer to linear.

        Thanks for the discussion.

        - tye