Tanktalus has asked for the wisdom of the Perl Monks concerning the following question:

A co-worker and I are evaluating two different design paths for a certain part of our code. In an attempt to avoid bias (for or against me ;->), I'm going to present both styles without reference to whose idea each one is, and thus, who the advocate of each idea is. Any input from uninvolved parties is appreciated.

  1. Simple behaviour.

    In this style, which is already implemented and working, code to handle data from a single location (call this the "cache") is present in a number of places. This also includes code that handles duplication, e.g., the names of two pieces of different data collide, so we use an extra unique identifier in these cases during the placement of the data, and remove the identifier on the way out. Note that a hash of arrays is actually inappropriate here - hash of hashes is actually closer since we only need a specific instance of this data at a time.

    Each use of this cache is implemented separately, and is only about 10-15 lines. So far, this is used in about 5 modules. Growth is possible. So far, only 2 uses of the cache potentially collide with themselves.

    Adding new uses is actually generally pretty simple. New possible collisions are not expected to collide with existing data, thus they will only collide with themselves.

    # $name is somehow derived from the object value we're storing. The # details of that are (probably) unimportant. $name = $key_obj->derive_name($obj); # insert to cache: $cache{$name} = $obj->value(); # insert to cache with possible collision: $cache{$key_obj->object_ID() . $name} = $obj->value(); # remove from cache: $value = $cache{$name}; # remove from cache with possible collision: $value = $cache{$key_obj->object_ID() . $name};

    Debugging is simple in that analysis of the %cache is trivial.

  2. Simple usage.

    In this style, code to handle the cache, including collisions, is abstracted to a central module. That module is very complex, and also very new. The uses of the module are cut down to one line (or 5-6 lines if you use generous amounts of whitespace).

    Data is abstracted, but the internal representation is made more complex, which, given the age of the module, may mean much more entertaining debugging. New possible collisions would be taken care of by this abstraction.

    This is implemented, too, but not committed to the version control system.

    # The name is keyed to the object we're using so if we have two differ +ent # types of objects with the same ID, we can differentiate. # insert to cache: $cache->insert(tag => ref $key_obj . $key_obj->object_ID(), data => $o +bj->value()); # remove from cache: @values = $cache->retrieve(tag => ref $key_obj . $key_obj->object_ID() +);

    Debugging the $cache object's package is abstracted and a bit more messy. Because of the backing storage (treat %cache above as a tie'd hash to the same backing storage, and pretend the tie is perfect), there are certain characters that aren't allowed in the name (or tag here). This method abstracts such that those characters become allowed by taking an MD5 of the tag name, and using that as the hash key.

Note that the Cache::*Cache modules were evaluated and found lacking for this purpose. Also, a collision is pretty much impossible between unrelated pieces of code (we would have fundamental problems if this were the case that are completely unrelated to perl).

The question, then, is which way to go? What would other monks here do? Is there more information I can give (short of an actual implementation ;->) to help? The above code is not precisely the code, obviously, but is a generic representation of both styles - this is why I ask that you treat the %cache in the first style as a perfectly implemented tie - in the real implementation, nothing can go wrong there, almost as if it were a real perl hash.

Hopefully, whatever responses I get here will help convince one of us to go the other way ;-)

Replies are listed 'Best First'.
Re: Selecting one of two implementations
by merlyn (Sage) on Apr 25, 2005 at 15:05 UTC
    Note that the Cache::*Cache modules were evaluated and found lacking for this purpose.
    OK, I'll bite. What requirement did you leave out that disqualifies Cache::Cache and friends? That's immediately where I would have gone, given your requirements above. Well, maybe a thin wrapper that stashes an object and returns its stash handle if you were wanting to retrieve it uniquely.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      The underlying data are really files. Groups of files. The files do not contain interesting data, the data is the files. Files that could be hundreds of MB. Files that could be a couple KB. And both of these, together.

      To use Cache::FileCache, we would likely need to also use Archive::Tar to create the file that was cached, and then to pull the files out of the file cache. We have lots of memory - but not that much. Since Archive::Tar loads data in to memory, it's a bit cost prohibitive.

      Would you use Cache::*Cache modules for, say, CPAN? Perhaps - because there are only ever two files per item of interest (module distribution): the tarball and the checksum, meaning you always have to query the cache for exactly two files, and one's name is dependant on the other. I have an unknown number of files associated together, some of which are themselves tarballs (but, again, I'm not interested in the contents). Some of these files are related in such a way that a mostly-simply regex could extrapolate one from the other. Others are not related in name, and each insert/retrieval must hardcode each name in the first scenario, while the second puts them together.

      In the first scenario, Cache::FileCache works nearly identically to the existing code (except that existing code doesn't need to load 250MB tarballs into memory before writing them back out). The second scenario attempts to resolve this natural grouping inside the complexities of the module. The module takes file handles as a data input style, and places it directly to a data store without, again, loading the entire file into memory.

        I don't understand. Are you making a cache, or a database?

        You keep using the word cache. What are you caching? A cache means you are saving one calculated result in an attempt to avoid recaculating it again. A cache also means it can simply disappear, and it shouldn't affect any part of your design except to slow it down a bit. (In fact, your program should still work even if the cache was completely forgetful.)

        So, are you building a cache, or a database?

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

Re: Selecting one of two implementations
by dragonchild (Archbishop) on Apr 25, 2005 at 14:48 UTC
    The first scenario has one thing going for it - it works. It has one thing going against it - code duplication.

    The second scenario looks to be overcomplicated. It sounds like it's a glorified hash and the $key_obj is the guy that has to do all the work. Create a hierarchy that handles the various kinds of $key_obj's and you should be fine. I vote for Scenario 2.


    The Perfect is the Enemy of the Good.

Re: Selecting one of two implementations
by dave0 (Friar) on Apr 25, 2005 at 15:08 UTC
    In the second implementation, your API does seem a bit complex. Why not:
    $cache->insert( tag => $key_obj, data => $obj ); @values = $cache->retrieve( tag => $key_obj )
    or even
    $cache->insert( $key_obj, $obj ); @values = $cache->retrieve( $key_obj )
    and let the cache object deal with constructing your tags from $key_obj? Then the chances of making a mistake when constructing the tag are limited to your cache code internals, rather than all the callers. You'll also be able to change your tag structure in one place should it become necessary.

    Also, just curious here, but what did you find lacking in the Cache::* modules?

Re: Selecting one of two implementations
by davies (Monsignor) on Apr 25, 2005 at 15:23 UTC
    I'm far too new to Perl to know whether there is anything specific to drive you one way or the other, and generally I have worked with much higher level data (like taking spreadsheets and putting them into databases & getting them back again), so my comments should be viewed with this in mind. My experience is that repeating code is fine for throwaway projects, in that it can be far quicker to write for a few specific cases than for all general cases. But I have frequently found code that I thought was going to be thrown away called back into service for something slightly different, and in these situations, I have always found repetitious code to be the greatest evil. You always forget to modify one bit. Since your project sounds very much as though it will not be thrown away, I would advise going for the second approach.

    What really scares me is your comment that "growth is possible". If you write the code again in another module, the issue of collisions becomes much more serious, especially if two people end up writing modules at the same time that might cause collisions. Unless your project management guarantees that no more than one programmer can ever write code that reads the cache, you are almost certain to have an unhandled collision. Even if there will only ever be one programmer involved, the "layering" of collision management is likely to be fraught. You have five modules (let's call them the inner modules) at present, but say you write another two. Then another one. This eighth module has to negotiate the collision management of the two outer modules, then the inner modules. At the very least, I'd rather someone else wrote it! Even though new collisions are "not expected" to involve existing data, I wouldn't want to bet my little remaining sanity on it.

    The one exception I would make is if you are under time pressure. If you MUST have a solution by $DATE, then get it working as quickly as possible, and get it working WELL later.

    All the usual caveats, and probably a few unusual ones as well, should be taken as read.

    Regards,

    John Davies
Re: Selecting one of two implementations
by Transient (Hermit) on Apr 25, 2005 at 15:00 UTC
    The second solution is what I would go with based mostly upon separation of concerns. Depending upon the difficulty of implementation, refactoring and (most importantly) budget, that would be the way I would go.

    I tend to lean more towards OOP anyways, however, so be warned.
Re: Selecting one of two implementations
by salva (Canon) on Apr 25, 2005 at 15:19 UTC
    In the first option you are replicating code which is a bad thing. On the other hand, with the second option things are getting too complex so my question is, have you gone too far trying to abstract things?

    Maybe you can find and intermediate way that removes most of the code duplication on the first solution but without trying to centralice all the intelligence in a unique complex component. Could you separate the collision handling code in an independent module and combine it with the cache one when required?