Big config file reading, or cross-process caching mechanism

dd-b has asked for the wisdom of the Perl Monks concerning the following question:

So, now that I know why my picpage module is actually running slowly, I need to do something about it. (Originally Startup cost, CGI performance, continued in CGI to Apache::Registry, or to mod_perl

The module displays one photo from a photo gallery, with title, description, comments, navigation links to previous and next photos in the gallery, that kind of thing. It's to run in a non-database environment.

Currently it gets most information from a YAML file written out by the galpage script, which generates the thumbnail page for the gallery (not dynamically) after reading up the previous YAML file, looking at all the image files in the directory, checking for .name and .info files associated with the image files, and generally thrashing around a lot.

Part of the complexity is dictated by galpage being intended to replace a previous dynamic solution; it needs to read the same files that used, and incorporate them into the YAML for the future. It's also necessary for me to edit the YAML file manually -- the new way of doing things is to run galpage on a directory with just the image files, then edit into the YAML file the titles and such. This, I might add, is a solution intended for me, not something to be turned over later to users.

In large directories, it turns out that reading the YAML is the largest component of the time spent by picpage (2.5 seconds, against the next largest chunk being smaller than .01).

I'm trying to reject the solution of writing a binary file (which should then be readable with far less parsing), because that would require creating an application to do that manual editing in the binary file, which would be a huge amount of work and provide a much less pleasant interface to editing the data in the file (I'm an emacs user; 'nuff said).

So I see two classes of solutions:

Find a human-editable way to store Perl data structures with better read-in performance than YAML.
Have picpage (which is now running in mod_perl) cache the YAML data once read in, so near-by references to other pictures on the same page won't be slow. Since this is for web work, this caching would have to be shared among processes.

So, are there modules to suggest? I remember that HTML::Mason has a very nice caching system, which isn't actually specific to Mason, for example.

Comment on Big config file reading, or cross-process caching mechanism

Replies are listed 'Best First'.
•Re: Big config file reading, or cross-process caching mechanism by merlyn (Sage) on Jan 17, 2004 at 18:49 UTC
Don't use YAML. Use Storable, which is much faster. If you need manual edits, just write a quicky YAML-to-Storable and Storable-to-YAML tweaker. If that's still not fast enough, use DBD::SQLite, and then you can extract from the file only exactly what you need. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: •Re: Big config file reading, or cross-process caching mechanism by dd-b (Pilgrim) on Jan 17, 2004 at 19:50 UTC
I see what you mean: Reading yaml took 1074368876.73842 to 1074368878.98043, 2.24200582504272 Reading storable took 1074368879.0243 to 1074368879.04725, 0.0229549407958984 So that's a possibility. I'm still interested in the caching possibility too, though.	[reply]
•Re: Re: •Re: Big config file reading, or cross-process caching mechanism by merlyn (Sage) on Jan 17, 2004 at 19:59 UTC
Well, my favorite caching module is Cache::Cache and its hoardes, like Cache::FileCache. And they use Storable. And they're used as the caching mechanism in HTML::Mason. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: Re: •Re: Big config file reading, or cross-process caching mechanism by perrin (Chancellor) on Jan 17, 2004 at 19:58 UTC
It's easy to do a quick cache using MLDBM::Sync (which uses Storable for serialization and handles locking for you). This is faster than what Mason currently uses. You could keep a "last modified" time in your cache data and stat the YAML file to make sure your cache is up to date. If the YAML file is newer than the cache, read it and write the data to cache. However, a cache won't solve the problem if you actually have to write the YAML from your program as well as read it.	[reply]
Re: Big config file reading, or cross-process caching mechanism by toma (Vicar) on Jan 17, 2004 at 21:57 UTC
There is also a third possibility to solve your problem. It is very easy to just move the part of your code that reads the YAML file to a `BEGIN {}` block. Then the overhead only occurs on the first execution of your mod_perl program within each httpd process. The only disadvantage is that all processes will share the same file, but it looks like this is fine, since you have hardcoded the YAML file name anyway! I use this trick to read files > 100MB into memory to enable fast searches of the data. If you make a module that does this, you can also load this module in your Apache startup, and then all the Apache daemons can share the same memory. If the YAML file needs to be dynamic, just have the mod_perl program check the date on the file before using the data. If the file is newer than when you last read it, reread it. Update I noticed that you are actually reading many YAML files in different directories. It's still easy, just make a hash of YAML file objects, using the directory path as the key of the hash. Then check for the hash to see if the key exists, and only read the YAML file if you need to. It should work perfectly the first time! - toma	[reply] [d/l]
Re: Big config file reading, or cross-process caching mechanism by waswas-fng (Curate) on Jan 18, 2004 at 01:12 UTC
Do use some database such as DBD::SQLLite. It makes your metadata more searchable and flexible in the long run. Also if you find yourself with many many files per directory consider putting them in a tree structure to avoid long stat()s. Also if there are many files and you find yourself finding in the treee a lot, and are on unix, see if your filesystem supports noatime mount options. That can reduce recursive stat times by a _LOT_. -Waswas	[reply]