bot403 has asked for the wisdom of the Perl Monks concerning the following question:

Just looking for a little advice before I try to re-implement the wheel.

I'm looking for an interface/module/method that will help me manage a few hundred of potentially large files 1M->200GB as a "cache".These files are very expensive computationally to create but can easily be re-made so we're looking to cache results on the filesystem. If they get destroyed because the cache had to purge them then we'll just re-create them.

In a nutshell I want to tie a set of files to a caching algorithm.

I've looked into CHI, Cache::Cache and Cache::File but they seem to be missing a few of my requirements.
1. I want to tell the cache what the object size is and have it free up that space for me if possible.
2. I want to store the object myself with my own filename. Processes that cant talk to perl need to be able to find the files by their filename.
3. I don't care about cache speed. Get/Set requests will be measured in the dozens per hour.

I really want is a size-aware cache where I can say:

1. I want to store a 100GB file called Foofile.dat in the cache. Is there room?
2. No? Then please make 100GB for me by removing files according to LRU or some other cache policy.
3. Is there room now? Then allocate a 100GB block of space in the cache where Foofile.dat will go.
This 100GB is gone and cant be used by another process unless its purged.
4. Kickoff other process that writes Foofile.dat

later....some other process wants the file

1. Is Foofile.dat in the cache? Either ask in perl or look in the cache directory with ls or stat..etc.
2. Yes? Ok great. I'll read Foofile.dat myself.
3. No? Awww....handle fail condition.

Does such a thing already exist or shall I have to make one? I haven't poked too deep but I may be able to extend one of the cache modules in CPAN.

Update: It looks like writing a CHI::Driver is the way to go. After looking at the source to CHI::Driver::File it seems simple enough and doable and I'll probably just take CHI::Driver::File as a base and fork it for my purposes.

Replies are listed 'Best First'.
Re: Managing a "cache" of files
by Tanktalus (Canon) on Jul 28, 2009 at 16:13 UTC

    I had a similar requirement a while back, though my difference is that I didn't care how big the cache got - nothing could get deleted. So there's no LRU algorithm - once I set something, I don't want it gone until I tell you to get it gone. And thus Cache::Repository. If you add LRU to it (with, say, a size of '0' meaning 'unlimited'), send the patches back :-) Of course, it may be too far off what you need to bother with, but it's at least something to look at.

      Thanks! I hadn't seen that but its getting closer. It takes care of the "use my filenames" and some of the size management and counting aspects. Obviously its missing a lot of other pieces though.... :(
Re: Managing a "cache" of files
by perrin (Chancellor) on Jul 28, 2009 at 16:32 UTC
    CHI does all of this except the arbitrary filenames. Why don't you just subclass the file cache for CHI and change it to allow passed in filenames?
Re: Managing a "cache" of files
by Illuminatus (Curate) on Jul 28, 2009 at 16:36 UTC
    The phrase "very expensive computationally to create but can easily be re-made" seems like an oxymoron to me. Why do you need to cache them at all if they are very easy to re-make? Generally, the purpose of a 'cache' is to provide faster-than-normal access to data. You say you don't care about speed of access. Since you are talking about potentially a ~TB worth of data, seems to me that you're really just talking about perhaps a separate filesystem.
      You're right. Perhaps I was unclear. They're expensive CPU-wise to make but it's trivial from a user interface perspective/ user interaction perspective to re-make them. Its bad that we have to spend the CPU but its not a huge inconvenience or error. As with all caching its not a big deal if we dont have a cached version since we can get the "original" and cache it but the purpose of the cache is to not have to get the original every time.