in reply to Re: Selecting one of two implementations
in thread Selecting one of two implementations

The underlying data are really files. Groups of files. The files do not contain interesting data, the data is the files. Files that could be hundreds of MB. Files that could be a couple KB. And both of these, together.

To use Cache::FileCache, we would likely need to also use Archive::Tar to create the file that was cached, and then to pull the files out of the file cache. We have lots of memory - but not that much. Since Archive::Tar loads data in to memory, it's a bit cost prohibitive.

Would you use Cache::*Cache modules for, say, CPAN? Perhaps - because there are only ever two files per item of interest (module distribution): the tarball and the checksum, meaning you always have to query the cache for exactly two files, and one's name is dependant on the other. I have an unknown number of files associated together, some of which are themselves tarballs (but, again, I'm not interested in the contents). Some of these files are related in such a way that a mostly-simply regex could extrapolate one from the other. Others are not related in name, and each insert/retrieval must hardcode each name in the first scenario, while the second puts them together.

In the first scenario, Cache::FileCache works nearly identically to the existing code (except that existing code doesn't need to load 250MB tarballs into memory before writing them back out). The second scenario attempts to resolve this natural grouping inside the complexities of the module. The module takes file handles as a data input style, and places it directly to a data store without, again, loading the entire file into memory.

  • Comment on Re^2: Selecting one of two implementations

Replies are listed 'Best First'.
Re^3: Selecting one of two implementations
by merlyn (Sage) on Apr 25, 2005 at 15:29 UTC
    I don't understand. Are you making a cache, or a database?

    You keep using the word cache. What are you caching? A cache means you are saving one calculated result in an attempt to avoid recaculating it again. A cache also means it can simply disappear, and it shouldn't affect any part of your design except to slow it down a bit. (In fact, your program should still work even if the cache was completely forgetful.)

    So, are you building a cache, or a database?

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Forgive my improper terminology - I didn't take computer science, so I may be using terms incorrectly. I would definitely like to extend the module into a cache (which would include a callback to generate missing information), but I suppose that we're looking more at a database. My confused terminology actually comes from my investigation of Cache::Cache. The documentation uses the example:

      use Cache::FileCache; my $cache = new Cache::FileCache( ); my $customer = $cache->get( $name ); if ( not defined $customer ) { $customer = get_customer_from_db( $name ); $cache->set( $name, $customer, "10 minutes" ); } return $customer;
      So, if our calling code is implemented similarly, then the database becomes a cache. (Generating 200MB tar.gz files off the network isn't free, but using already-built tarballs from the local drive is about as free as it's going to get.)

      The way our code works now is that we start with a preparation phase, where the cache/database is populated with things that are going to be needed 1-15 times. Then we enter the execution phase where we pull items from the cache/database as needed to build what we need to build.

      Note that we've also given thought to actually using an RDBMS rather than the filesystem to allow us different flexibilities - e.g., having multiple machines pull from the common cache via DBI. Or using FTP or HTTP as protocols to do similarly. The RDBMS idea is more of a coolness thing - FTP or HTTP are probably better (i.e., more practical) protocols for this.