demerphq has asked for the wisdom of the Perl Monks concerning the following question:

Ive been thinking about an aspect of Perlmonks internals and im wondering what people think. Im not posting to PMD becuase I consider this more of an abstract programming question that just an internal PM issue.

We use a caching mechanism on PM. Instead of refetching nodes every time if we have them cached we just check the nodes version number in a version table. Any time a node gets updated the version entry for the node is bumped up by one. This means two things: we have to fetch the version number everytime we use a node, and over time the size of the version table is pretty well 1:1 with the node table.

My point is this: version table fetches are the single most common query we do so it would seem that optimising them as much as possible would be a well a rewarded task.

What im thinking of is a strategy to keep the table size small. My ideas is that every time an update is done to the version table a random check would be made, say rand(10000)<1 which if it passed would result in some portion, maybe 1% or so of the version records would be deleted. Ideally these records would be the oldest that were added to the table. This would involve adding a datestamp and index on the dates to the table. Essentially my contention is that we should see fetch times go up quite a bit if we could keep the version an order of magnitude or two smaller and especially if only contained nodes that were regularly used and updated (which in fact typically is User nodes).

Of course its also possible that an idea like this would slow things down overall. Index maintenance on the added field and on the deletes may be more expensive than the time gained by keeping the indexes small. Cache misses would probably increse marginally so possibly the price there would overweigh the gain. The problem is that without actually implementing the solution i have no way to know.

What im wondering is do those of you with better database/MySQL experience and knowledge than me think that this is worthy area of investigation or do you think that overall this is a blind alley?

---
demerphq

  • Comment on Randomization as a cache clearing mechanism

Replies are listed 'Best First'.
Re: Randomization as a cache clearing mechanism
by dragonchild (Archbishop) on Nov 19, 2004 at 17:48 UTC
    Since you're using MySQL, why don't you use MySQL's query caching mechanism? That's my first question.

    The second question is why do you have a separate version table from the nodes table. You have to do two queries anyways ... why not just query the nodes table twice? With an index on (node_id, version) and a query of "SELECT version FROM nodes WHERE node_id = ?", you don't even hit the table because all the info needed is in the index. This is an extremely fast query, made even faster with a query_cache.

    Third - would it be possible to cache usernodes separate from thread nodes? I understand that the Everything engine likes to be agnostic about what a node is and does, but if you can cache usernodes separate from thread nodes, then you have a very slight performance hit from having the find the right cache, but each cache thrashes less.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      MySQL's query cache works differently that you expect or you did not think that through. It keys the cache off the select string, On queries that do not repeat often there is up to a 13% overhead for the cache code (housekeeping). Because they do 100's of hits per page load the cache would fill with low hit % data and just be a liability. MySQL's query cache starts to make sense when you have a complex query that thrashes and only has a few thousand (up to ~ forty or so thousand) select variants that are executed often. It can also be noted that they even state that the theoretical speedup (given ~100% cache hits) is only around 220%. You can tweak the cache size with query_cache_limit but mind you as you increase the size the overhead goes up -- worst case is worst, best case is worse.


      -Waswas
        I'm very well aware of how MySQL's query caching works - I built a reporting system around it. With a good 10-50M cache, you can do a whole heck of a lot of good. The important things about it are:
        1. You can just turn it on - there are no code or schema changes. This makes it easy to benchmark.
        2. It handles all the aging for you in very optimized C. You say there's a 13% overhead. I'm guessing that demerphq's plans may end up with a 50-150% overhead in the average case.

        In addition, I'm going to guess that there's going to be a much higher gain that you might think. Many of the hits, I'm guessing, have to do with invariants - monk data, nodelet data, and the like. I know I do at least 400 pageviews/day on this site, and I'm a low-hit regular. If half the queries for just the regulars get to be cached, then that's at least a 13% savings right there.

        So, yes, it does work as I think and I did think it through. It's not the ideal solution, but it's definitely a quick-hit easy one, as well as easy to verify - just turn it on for a week and see how performance plays. If it doesn't work, then turn it off. No harm, no foul.

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Randomization as a cache clearing mechanism
by radiantmatrix (Parson) on Nov 19, 2004 at 17:53 UTC

    Speaking from a general DB design point of view, I think you have the right idea, but I would go about it in a different manner. Of course, I'm not a master of DB design, so I'm happy to be corrected.

    What I would do is add a datetime stamp and a "node type" column (if the latter doesn't already exist). I would then create a separate table of node_types and what their maximum ages should be. Then, I'd run a cron that would delete any cache record that is older than the maximum age for its node type.

    Alternatively, you could disable caching on User nodes altogether, and those viewing User nodes would just have to wait a bit -- but that's assuming User nodes are visited rarely enough that there wouldn't be a significant load savings in caching them over generating them on-the-fly.

    Of course, all of this might be complete bunk because I haven't examined the PM code, and so don't know how it all internally works. But, generally, one would aim for predictable reduction in the best places rather than random reduction of oldest records.


    radiantmatrix
    require General::Disclaimer;
    Perl is

Re: Randomization as a cache clearing mechanism
by mpeppler (Vicar) on Nov 19, 2004 at 17:47 UTC
    I don't know MySQL well at all, so my comments may be off-base.

    First off, things really depend on the table schema and the way the table is used. From my experience with Sybase I'd say that purging the table might not gain you much in terms of fetch times, and the deletes might actually take up more resources than you're likely to save.

    I did a check recently for a fairly large table (15+ million rows), and when a fetch is done based on a unique index the number of IO operations is quite small (4 IOs or so, IIRC). I don't know if you can get this sort of statistics for MySQL, or how large this table is, but I'd check that sort of thing before doing anything else.

    I also don't know if MySQL generates any sort of locking contention when you start deleting a (comparatively) large number of rows - that's another thing that you should consider.

    Michael (who has to rush off to dinner now... :-)

      I also don't know if MySQL generates any sort of locking contention when you start deleting a (comparatively) large number of rows - that's another thing that you should consider.

      Depending on the table type used, it can be a big or small issue. MyISAM is designed from sequential adds, few deletes, and a bazillion reads. If you start deleting bunches of stuff, you can actually prevent reads from happening. InnoDB is better about that, but has its own overhead issues due to ACID compliance and row-level locking.

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Randomization as a cache clearing mechanism
by kappa (Chaplain) on Nov 19, 2004 at 19:03 UTC

    You see, the fetch operation is O(log n). What that means is that for sufficiently big tables IT DOES NOT (almost) DEPEND on size :) log(2000000) - log(1000000) = 1

    That single fetch should be very fast on indexed tables. I'd suggest profiling before looking for places to optimize.

      Im more concerned with 400k versus say 20k records. :-) And yes I know the fetch is O(log n). Also i suspect you will find that the log involved is not base 10 but rather base 2.

      E:\.cpan\build\Module-Build-0.2602>perl -e"print log 2_000_000 - log 1 +_000_000" 14.5086508307451

      So im still wondering if i should expect to see speedup on this query. I mean we execute it hundreds of time per page fetch.

      ---
      demerphq

        Oh, sorry for confusion. Actually, the log involved in your code is a natural log (base e), and the log involved in the discussion is of course base 2. Log base 10 is rarely relevant in anything but counting decimal digits :) Sorry once again, that was the reason I did not write the formula in Perl.

        400k vs. 20k means 19 vs 15 seeks in the index (or steps in the binary tree). Do you think it will worth it?

        What would be more interesting is to hear about the whole caching infrastructure here on PM. Hundreds of cache-checks per web page sounds like a symptom of something rather strange happening. I'd be very tempted to employ memcached, implement cache-checks pipelining and even try to eliminate checks altogether. If we invalidate the cache for a single node when the node is updated we won't need no versions and no checks at all.

        P.S. Your code does not calculate what you think it should. The output equals to log (2000000 - log 1000000).

Re: Randomization as a cache clearing mechanism
by perrin (Chancellor) on Nov 19, 2004 at 19:18 UTC
    As others mentioned, table size is not very relevant if there is an index. I would also add that in situations where the size of your data does matter, random purges done from a web app are not a very good approach. The right way to do it is a cron job, preferably set to run once per day at an off-peak time.
Re: Randomization as a cache clearing mechanism
by ryantate (Friar) on Nov 20, 2004 at 23:52 UTC

    Your question is interesting to me because I'm thinking of implementing a similar caching system. As such, I'm curious why you decided not to just clear a node out of the cache when doing any operation that would make the cache out of date, for example updates to the node. Then you wouldn't have to keep any special record in the database.

    My guess is that you have so many potentially cache-invalidating operations that you didn't want to clear the cache on each of these operations because of the maintenance headache. So you've built a sort of event/trigger mechanism using a database table. Now the code that updates the database doesn't have to know about the code that maintains the cache.

    My plan is to do all the cache management at the control layer -- eg the script that processes a new message will clear the cache of the message index -- but I'm worried this won't scale beyond a few scripts and I'll need to go with a more formal event-driven system like yours.

    Interesting problem, sorry I don't have any direct feedback.


    Update: Answer is here
Re: Randomization as a cache clearing mechanism
by Anonymous Monk on Nov 22, 2004 at 17:24 UTC
    Your question is wehre to introduce the mechanism that will start the cleanning process of the cache. That is different of what I can do to make less fetches to the version table.

    I don't think that the random call is good, since this won't guarantee that you will clean the cache when it's big and when you have less CPU usage. I think that you should take a look in the garbage collection mechanisms or memory defragmentation process, that are made when we have a high usage of the memory and low CPU usage. So as a advice, I think that you should keep some track of the size of your table and requests to your site, and chouse a parameter to call your clean process.

    Now about the fetch of updates, I think that the Perl side, where you hold a cache of nodes, you can reduce the fetch (test of a new version) with the idea that a node will be less updated when it start to become old. So, you could add some window time to reduce the check of new versions. For example, if a node is 30min older will check for updates only if the last check is 3min older. If less than 30min, always. If 1day old each 5min. so, you need to study this and find the best parameters.

    Also you could insert a inverse logic in your cache and version system. Where now you check for updates in each item in your cache when you will use it. You can make the inverse adding a mechanism that is called each 10s, and delete the items that are old. the idea is to make a list of what need to be deleted from the cache and propagate that to all your process.

    By gmpassos