Randomization as a cache clearing mechanism

demerphq has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Randomization as a cache clearing mechanism by dragonchild (Archbishop) on Nov 19, 2004 at 17:48 UTC
Since you're using MySQL, why don't you use MySQL's query caching mechanism? That's my first question. The second question is why do you have a separate version table from the nodes table. You have to do two queries anyways ... why not just query the nodes table twice? With an index on (node_id, version) and a query of "SELECT version FROM nodes WHERE node_id = ?", you don't even hit the table because all the info needed is in the index. This is an extremely fast query, made even faster with a query_cache. Third - would it be possible to cache usernodes separate from thread nodes? I understand that the Everything engine likes to be agnostic about what a node is and does, but if you can cache usernodes separate from thread nodes, then you have a very slight performance hit from having the find the right cache, but each cache thrashes less. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re^2: Randomization as a cache clearing mechanism by waswas-fng (Curate) on Nov 19, 2004 at 21:07 UTC
MySQL's query cache works differently that you expect or you did not think that through. It keys the cache off the select string, On queries that do not repeat often there is up to a 13% overhead for the cache code (housekeeping). Because they do 100's of hits per page load the cache would fill with low hit % data and just be a liability. MySQL's query cache starts to make sense when you have a complex query that thrashes and only has a few thousand (up to ~ forty or so thousand) select variants that are executed often. It can also be noted that they even state that the theoretical speedup (given ~100% cache hits) is only around 220%. You can tweak the cache size with query_cache_limit but mind you as you increase the size the overhead goes up -- worst case is worst, best case is worse. -Waswas	[reply]
Re^3: Randomization as a cache clearing mechanism by dragonchild (Archbishop) on Nov 20, 2004 at 04:07 UTC
I'm very well aware of how MySQL's query caching works - I built a reporting system around it. With a good 10-50M cache, you can do a whole heck of a lot of good. The important things about it are: You can just turn it on - there are no code or schema changes. This makes it easy to benchmark. It handles all the aging for you in very optimized C. You say there's a 13% overhead. I'm guessing that demerphq's plans may end up with a 50-150% overhead in the average case. In addition, I'm going to guess that there's going to be a much higher gain that you might think. Many of the hits, I'm guessing, have to do with invariants - monk data, nodelet data, and the like. I know I do at least 400 pageviews/day on this site, and I'm a low-hit regular. If half the queries for just the regulars get to be cached, then that's at least a 13% savings right there. So, yes, it does work as I think and I did think it through. It's not the ideal solution, but it's definitely a quick-hit easy one, as well as easy to verify - just turn it on for a week and see how performance plays. If it doesn't work, then turn it off. No harm, no foul. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re^4: Randomization as a cache clearing mechanism by waswas-fng (Curate) on Nov 22, 2004 at 19:36 UTC
Re^5: Randomization as a cache clearing mechanism by dragonchild (Archbishop) on Nov 22, 2004 at 20:28 UTC
Re: Randomization as a cache clearing mechanism by radiantmatrix (Parson) on Nov 19, 2004 at 17:53 UTC
Speaking from a general DB design point of view, I think you have the right idea, but I would go about it in a different manner. Of course, I'm not a master of DB design, so I'm happy to be corrected. What I would do is add a datetime stamp and a "node type" column (if the latter doesn't already exist). I would then create a separate table of node_types and what their maximum ages should be. Then, I'd run a cron that would delete any cache record that is older than the maximum age for its node type. Alternatively, you could disable caching on User nodes altogether, and those viewing User nodes would just have to wait a bit -- but that's assuming User nodes are visited rarely enough that there wouldn't be a significant load savings in caching them over generating them on-the-fly. Of course, all of this might be complete bunk because I haven't examined the PM code, and so don't know how it all internally works. But, generally, one would aim for predictable reduction in the best places rather than random reduction of oldest records. radiantmatrix `require General::Disclaimer;` Perl is	[reply]
Re: Randomization as a cache clearing mechanism by mpeppler (Vicar) on Nov 19, 2004 at 17:47 UTC
I don't know MySQL well at all, so my comments may be off-base. First off, things really depend on the table schema and the way the table is used. From my experience with Sybase I'd say that purging the table might not gain you much in terms of fetch times, and the deletes might actually take up more resources than you're likely to save. I did a check recently for a fairly large table (15+ million rows), and when a fetch is done based on a unique index the number of IO operations is quite small (4 IOs or so, IIRC). I don't know if you can get this sort of statistics for MySQL, or how large this table is, but I'd check that sort of thing before doing anything else. I also don't know if MySQL generates any sort of locking contention when you start deleting a (comparatively) large number of rows - that's another thing that you should consider. Michael (who has to rush off to dinner now... :-)	[reply]
Re^2: Randomization as a cache clearing mechanism by dragonchild (Archbishop) on Nov 19, 2004 at 17:50 UTC
I also don't know if MySQL generates any sort of locking contention when you start deleting a (comparatively) large number of rows - that's another thing that you should consider. Depending on the table type used, it can be a big or small issue. MyISAM is designed from sequential adds, few deletes, and a bazillion reads. If you start deleting bunches of stuff, you can actually prevent reads from happening. InnoDB is better about that, but has its own overhead issues due to ACID compliance and row-level locking. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]
Re: Randomization as a cache clearing mechanism by kappa (Chaplain) on Nov 19, 2004 at 19:03 UTC
You see, the fetch operation is `O(log n)`. What that means is that for sufficiently big tables IT DOES NOT (almost) DEPEND on size :) `log(2000000) - log(1000000) = 1` That single fetch should be very fast on indexed tables. I'd suggest profiling before looking for places to optimize.	[reply] [d/l] [select]
Re^2: Randomization as a cache clearing mechanism by demerphq (Chancellor) on Nov 19, 2004 at 19:11 UTC
Im more concerned with 400k versus say 20k records. :-) And yes I know the fetch is O(log n). Also i suspect you will find that the log involved is not base 10 but rather base 2. `E:\.cpan\build\Module-Build-0.2602>perl -e"print log 2_000_000 - log 1 +_000_000" 14.5086508307451` [download] So im still wondering if i should expect to see speedup on this query. I mean we execute it hundreds of time per page fetch. --- demerphq	[reply] [d/l]
Re^3: Randomization as a cache clearing mechanism by kappa (Chaplain) on Nov 20, 2004 at 01:36 UTC
Oh, sorry for confusion. Actually, the log involved in your code is a natural log (base e), and the log involved in the discussion is of course base 2. Log base 10 is rarely relevant in anything but counting decimal digits :) Sorry once again, that was the reason I did not write the formula in Perl. 400k vs. 20k means 19 vs 15 seeks in the index (or steps in the binary tree). Do you think it will worth it? What would be more interesting is to hear about the whole caching infrastructure here on PM. Hundreds of cache-checks per web page sounds like a symptom of something rather strange happening. I'd be very tempted to employ `memcached`, implement cache-checks pipelining and even try to eliminate checks altogether. If we invalidate the cache for a single node when the node is updated we won't need no versions and no checks at all. P.S. Your code does not calculate what you think it should. The output equals to log (2000000 - log 1000000).	[reply] [d/l]
Re^4: Randomization as a cache clearing mechanism by demerphq (Chancellor) on Nov 20, 2004 at 09:47 UTC
Re^5: Randomization as a cache clearing mechanism by kappa (Chaplain) on Nov 20, 2004 at 11:20 UTC
Some notes below your chosen depth have not been shown here
Re: Randomization as a cache clearing mechanism by perrin (Chancellor) on Nov 19, 2004 at 19:18 UTC
As others mentioned, table size is not very relevant if there is an index. I would also add that in situations where the size of your data does matter, random purges done from a web app are not a very good approach. The right way to do it is a cron job, preferably set to run once per day at an off-peak time.	[reply]
Re: Randomization as a cache clearing mechanism by ryantate (Friar) on Nov 20, 2004 at 23:52 UTC
Your question is interesting to me because I'm thinking of implementing a similar caching system. As such, I'm curious why you decided not to just clear a node out of the cache when doing any operation that would make the cache out of date, for example updates to the node. Then you wouldn't have to keep any special record in the database. My guess is that you have so many potentially cache-invalidating operations that you didn't want to clear the cache on each of these operations because of the maintenance headache. So you've built a sort of event/trigger mechanism using a database table. Now the code that updates the database doesn't have to know about the code that maintains the cache. My plan is to do all the cache management at the control layer -- eg the script that processes a new message will clear the cache of the message index -- but I'm worried this won't scale beyond a few scripts and I'll need to go with a more formal event-driven system like yours. Interesting problem, sorry I don't have any direct feedback. Update: Answer is here	[reply]
Re: Randomization as a cache clearing mechanism by Anonymous Monk on Nov 22, 2004 at 17:24 UTC
Your question is wehre to introduce the mechanism that will start the cleanning process of the cache. That is different of what I can do to make less fetches to the version table. I don't think that the random call is good, since this won't guarantee that you will clean the cache when it's big and when you have less CPU usage. I think that you should take a look in the garbage collection mechanisms or memory defragmentation process, that are made when we have a high usage of the memory and low CPU usage. So as a advice, I think that you should keep some track of the size of your table and requests to your site, and chouse a parameter to call your clean process. Now about the fetch of updates, I think that the Perl side, where you hold a cache of nodes, you can reduce the fetch (test of a new version) with the idea that a node will be less updated when it start to become old. So, you could add some window time to reduce the check of new versions. For example, if a node is 30min older will check for updates only if the last check is 3min older. If less than 30min, always. If 1day old each 5min. so, you need to study this and find the best parameters. Also you could insert a inverse logic in your cache and version system. Where now you check for updates in each item in your cache when you will use it. You can make the inverse adding a mechanism that is called each 10s, and delete the items that are old. the idea is to make a list of what need to be deleted from the cache and propagate that to all your process. By gmpassos	[reply]