spurperl has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

I want to design a cache data structure. My requirements for it are:

Note: this is of a purely academic purpose - in order to learn. Thus I'm not interested in ready modules, I want to think out the design.

A simple hash would do for a cache, but of an unlimited size. Limiting it with LRU removal would take some more sweat.

Alongside with the hash, I should probably also have an array sorted by "recent usage", which removes (at head) and inserts (at tail) items in O(1). The only problem is when an item is referenced - I'd like to mark it as "recently used", moving it to the top of the list. First, this means splicing (taking an item from the middle of an array), second how to find it quickly...

The hash can hold a reference to the priority array for quick lookup - it costs space, but the caches are size limited so it's OK. About splicing, I can't think of easy solutions, maybe a linked list should be used instead of an array.

What are your opinions ?

Replies are listed 'Best First'.
Re: Designing a cache data structure
by Tanktalus (Canon) on Feb 20, 2005 at 20:15 UTC

    O(1) time for the first problem - insertion and retrieval - is no sweat in perl. Pushing (at tail) and shifting (at head) from a normal array is likely going to be as fast as it can be if you let perl do it. However, marking something as "recently used" and filtering it to the back of the line - that may be a bit more problematic with perl's arrays.

    I think your consideration of a linked list instead of an array for LRU is likely to be as good as it gets. In this scenario, you'd have a reference to the LRU cache from the main cache so you can find the LRU in fixed time. Each element in the LRU cache would have a link to the previous and next items in the list so that you can unlink the current object (by setting $obj->{prev}->{next} = $obj->{next}; $obj->{next}->{prev} = $obj->{prev}) and then tack it on the end (you'll need to keep references to first/last in the list to find this in constant time).

    Extending this just a bit, and the value you store in the cache would itself be the LRU object (reduces references by one).

    Alternately, you could go with modern garbage-collection algorithms, discard all the overhead in memory, and live with a slower LRU algorithm. You can get rid of a lot of data this way, but you also can get rid of a lot of access overhead. Each access to the data may be O(1), but if you access them Y times, we've just become O(Y). By living with an O(N) algorithm to find LRU objects, if N is less than Y, we still have a faster algorithm. That is, if you do more accessing than creation/insertion, you may actually be better off just having the cache store the last access time, and then when memory is low, go through and sort on the last access time, thus you find the oldest access times, and discard until you have enough memory again.

    Update: Where exactly the balance is between O(N) and O(Y) in the previous paragraph is, is very difficult to figure out in the borderline cases. In extreme cases (almost all insert/create and no retrieval, vs almost all retrieval and almost no insert/create), the direction to go is obvious. But I'm not going to pretend to be able to give advice on any case in between. I lean towards the slower LRU algorithm partly because simpler code usually means fewer defects, so it's probably better to have slower, correct code than faster, incorrect code. :-)

Re: Designing a cache data structure
by kvale (Monsignor) on Feb 20, 2005 at 20:00 UTC
    I think this will be a difficult enterprise, as access to both hashes and sorted arrays are O(log N). If the sorting is trivial, as in a stack or a queue, O(1) is possible if you ignore the time it takes to grow the arrays.

    -Mark

      As far as I know, Perl hashes provide O(1) access.
        This is untrue. As more elements get added to perl hashes, more buckets get added and the chances of hash collisions increase. Both of these aspects lead to a logarithmic increaese in the amount of work needed to acces the average key-value pair. You could always predeclare a large hash or array to get rid of pregressive bucket costs, but this only hides the loagirthmic increase in a large constant mutiplier; for sufficiently large N, it will start growing logarithmically again.

        The good news is that logarithmic growth is not bad at all, and people use larges hashes and sort large arrays all the time without much problem.

        -Mark

Re: Designing a cache data structure (linked list)
by tye (Sage) on Feb 20, 2005 at 23:59 UTC

    You want a linked list. Lots of ways to do such. I'll be posting the one I wrote soon.

    - tye        

      I dunno. I got pretty badly burned by splice trying to compete using a heap, and that didn't require managing references, just index math. How do you intend to set this up?

      Makeshifts last the longest.

        "O(1) Perl opcodes vs. splice" doesn't really compare to "O(log(N)) Perl opcodes vs. splice", especially when the constant multiplier for the O(1) is quite small.

        Splice might be efficient enough that the point where splice's O(N) nature finally takes over the small constant is a fairly large N.

        But I consider trying to optimize the simple O(1) linked list to be micro optimization and I don't plan to spend time futzing with benchmarks to see if an O(N) solution is a bit faster for the cache sizes I predict (perhaps quite inaccurately, in the long run).

        - tye