Repetitive File I/O vs. Memory Caching

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am working on a project which is running under a threaded mod_perl. All hits to the site go through the same script, much like PerlMonks with its index.pl. Now, each hit to this script needs to extract a certain amount of data from a file specified by the 'node' (to use PerlMonks analogy) in order to display a completed page to the browser.

The way I had coded it at first was as follows (I just faked a regex for locating the needed extraction; the real one is quite a bit more complex):

# <SNIP> - handle some pre-processing

# $FILE contains a safe path to an existing file on
# the filesystem, based on browser input.
open( my $fh, '<', $FILE ) or die "open failed: $!";
my $file = do { local $/; <$fh> };
close( $fh );

my ($needed) = $file =~ /\A<!\-\-(.*?)\-\->/s;

# now output the document, using the $needed info.
[download]

Then it struck me that this will be running under mod_perl and I thought about caching the needed extractions from the files into memory. So I rewrote it something like this:

BEGIN {
   use vars qw($NEEDED);
}

# <SNIP> as in first example.
unless (exists $NEEDED{$FILE}) {
   open( my $fh, '<', $FILE ) or die "open failed: $!";
   my $file = do { local $/; <$fh> };
   close( $fh );

   ($NEEDED{$FILE}) = $file =~ /\A<!\-\-(.*?)\-\->/s;
}

# now output the document, using the $NEEDED{$FILE} info.
[download]

The files from which the required information is being extracted from aren't all that large, so it seems to me that slurping the entire file contents on each hit wouldn't be too much of a burden. My basic question is whether such simple file I/O on not-too-large files would add up to many CPU cycles under heavy load. By using the caching, I am also hitting the HDD far less often. Again, is it enough to deserve a caching mechanism? Am I worrying too much by caching the extracted pieces or am I being smart?

NB: I figure that the hash method for caching is good enough in this case as it is a threaded mod_perl, so the caching won't be saved individually for each Apache process (as there is only one). As such, I wonder if perhaps using Cache::SharedMemoryCache so that there won't be an additional burder should it ever be moved to a non-threaded mod_perl.

Comment on Repetitive File I/O vs. Memory Caching Select or Download Code

Replies are listed 'Best First'.
Re: Repetitive File I/O vs. Memory Caching by graff (Chancellor) on Mar 28, 2004 at 05:52 UTC
I'm no expert on this, but I'll speculate. (If you're feeling generous, call it a "thought experiment"... :) The relative benefit may depend on the pattern of activity on the site. If a lot of clients hit the same page (same needed file) in a pretty short time, the server might be caching the file anyway -- 10 hits in a short time (with few or no intervening hits on other pages) won't be that different from a single hit in terms of HDD activity, and caching the little files in mod_perl doesn't buy you that much. On the other hand, if you have a broad dispersion of pages being selected somewhat randomly, keeping a cache in mod_perl will tend to cause its memory footprint to grow (rapidly at first, then more gradually), with the upper bound being the total size of all the little files; meanwhile, the recall rate for a given cached page in this pattern is relatively low, and again, caching in mod_perl doesn't buy you that much. In fact, if there's no "expiration" period for the cached files, you're likely going to end up with a lot of the cached page info being swapped out to virtual memory -- so when someone hits one of these pages, the server still needs to do disk i/o to serve the info, only now it's for the sake of memory swapping, rather than reading a small data file. There may be a scenario where the sort of caching you're suggesting could really be a boost for you, and it may be a realistic one for you, but personally, I'd opt for the "extra" overhead of reading small files, just to keep the whole thing simpler overall. To really speed things up, considering that the amount of data being fetched per page is fairly small, it would make more sense to store it all in a single MySQL or Postgres database; these things are built for speed (it's hard to improve on their approach to optimizing disk i/o), and mod_perl is built to take maximum advantage of the benefits they provide.	[reply]
Re^2: Repetitive File I/O vs. Memory Caching by Anonymous Monk on Mar 28, 2004 at 05:59 UTC
I think I'm agreeing with most of what you've said... it makes sense to me anyhow. The only part I don't like is the storing of these files in a database. Yes I could do it, but I always hear people who do such things grumbling later on due to editting quirks. These files will be editted once in a while and it is so much easier to open a file in your favorite editor and make changes than to update a database table. On the other hand, just for fun, I think I am going to create a simple script that will be a database editor... sounds like fun :)	[reply]
Re: Re^2: Repetitive File I/O vs. Memory Caching by graff (Chancellor) on Mar 28, 2004 at 06:28 UTC
You are quite right about the "grumbling" -- putting any sort of free-form text content into a database will tend to create a barrier for people who need to maintain and update that content. If there isn't a simple procedure in place to do that, it's a killer. Even when there is a "simple" procedure in place, the problem can be that it's the only procedure available. Editing text files and storing/updating them on disk really has become analogous to writing on paper: any number of utensils can be used, from the pencil stub invariably found on the floor to the $250 Cartier Fountain Pen. But the typical approach to maintaining text fields in a database is more like the old days of Ma Bell: this is the telephone that you get, it's black, you don't actually own it, and there's nothing you can do to change how it works. Maybe a better approach would be to perfect a system for maintaining the database by "importing" from all these little files -- let the files be updated by whatever means are considered suitable, then just fold the new version into the database by some simple process, about which the content authors are blissfully ignorant.	[reply]
Re: Re^2: Repetitive File I/O vs. Memory Caching by davido (Cardinal) on Mar 28, 2004 at 06:09 UTC
There's nothing preventing you from creating a small script that replaces a current page stored in a database with a newly edited version. That seems so trivial that it shouldn't be a factor discouraging you from using a database. All the "big" sites can't be completely off track. Dave	[reply]
Re: Repetitive File I/O vs. Memory Caching by davido (Cardinal) on Mar 28, 2004 at 06:05 UTC
It scares me to think that with your current implementation, there is nothing preventing the cache from growing to the total size of your website (every page will become cached eventually). Multiply that by the potential for multiple instances of the same caches, and you could end up with a potentially huge memory footprint. This is not a scalable approach. Care and caution could be exercised such that your cache will never grow beyond a predetermined size... a smarter caching method. ...and at the same time, you could find a way of dealing with multiple instances. I would not be surprised if caching could be used to gain some performance improvements. But I have to wonder if the caching should, perhaps, be a layer built onto the database side of things rather than the CGI script. Let a layer between the database and the CGI script deal with the nitty gritty task of calculating which pages are most popular, and limiting the cache size. Doing it this way could make it easier to deal with multiple instances also, since there need only be one instance of the database caching layer. Dave	[reply]
Re: Repetitive File I/O vs. Memory Caching by tachyon (Chancellor) on Mar 28, 2004 at 07:33 UTC
Any decent OS will cache the most recently accessed pages from disk in memory anyway, regardless of what you do. Any decent RDBMS will also cache data in memory remarkably effectively. Looking at your code it looks somewhat like you are re-implementing SSI. Why not just use SSI? On that thought if you have a limited number of dynamically generated pages making them static with periodic updates will be faster than dynamically serving them. cheers tachyon	[reply]
Re: Re: Repetitive File I/O vs. Memory Caching by Anonymous Monk on Mar 28, 2004 at 07:48 UTC
SSI does not cache	[reply]
Re^3 Repetitive File I/O vs. Memory Caching by cLive ;-) (Prior) on Mar 28, 2004 at 08:11 UTC
Oh contraire, SSI may not cache. If you use the Apache Xbit Hack, you can cache SSI calls. .02 cLive ;-)	[reply]
Re: Repetitive File I/O vs. Memory Caching by perrin (Chancellor) on Mar 28, 2004 at 07:27 UTC
Your understanding of data sharing in perl threads is not quite correct. You need to explicitly make that variable shared. Nothing is shared by default. Look at the threads::shared man page. If you end up switching to a multi-process MPM, don't use Cache::SharedMemoryCache -- it is extremely slow. Use a fast one like BerkeleyDB (with built-in locking), Cache::FastMmap, or MySQL.	[reply]
Re^2: Repetitive File I/O vs. Memory Caching by Anonymous Monk on Mar 28, 2004 at 08:32 UTC
It's the mod_perl that is threaded, not my perl script. Under a single-process/multi-threaded Apache, any variables declared such as what I did will share that variable across each mod_perl thread... it works for me anyhow :) If you don't believe me, pop this script under a mod_perl environment. Indeed, if I were using a multi-process Apache (forked rather than threaded), a separate count would be kept for each Apache process: `#!c:/perl/bin/perl -w $\|++; use strict; use CGI::Simple; use vars qw($count); # could also use our() my $CGI = CGI::Simple->new(); print $CGI->header(), ++$count;` [download]	[reply] [d/l]
Re: Re^2: Repetitive File I/O vs. Memory Caching by perrin (Chancellor) on Mar 28, 2004 at 17:25 UTC
Sorry, it's just not true. There is no sharing between perl threads unless you declare it, and perl threads are what mod_perl uses. Now, you can load data during startup, and it will be copied into each thread (or process), but it will not be shared. As soon as you write anything to it after startup, the thread you did the writing in will have a different version of it than all the others. The script that you've shown here will show a seprate count for every thread. The only way tha variable could be the same for every request with this script is if you are actually only running one thread, so everything keeps hitting the same one. If you don't believe me, ask our resident threads expert, liz, or ask on the mod_perl list.	[reply]
Re: Repetitive File I/O vs. Memory Caching by crabbdean (Pilgrim) on Mar 28, 2004 at 05:33 UTC
I don't know a whole heap about the comparative differences between file IO's and caching but my first thought on reading this was to test it. That is, write another small script that loops to simulate a million hits, one version run against caching the other against IO's and then use the "times" method in perl to get an idea of time usage for the process. Update: I should add that I'd suspect caching would be faster. Dean The Funkster of Mirth Programming these days takes more than a lone avenger with a compiler. - sam RFC1149: A Standard for the Transmission of IP Datagrams on Avian Carriers	[reply]
Re: Repetitive File I/O vs. Memory Caching by eXile (Priest) on Mar 28, 2004 at 06:11 UTC
I agree with the previous suggestion to test/measure. Last few weeks I start to use profiling (using Devel::DProf) and benchmarking (see benchmarking your code, this node got me started ) more and more and this helps me trying to understand where my code spends its time and where to begin speeding it up. I've never did this for mod_perl, but google found me that there is a performance tuning guide for mod_perl here, that (from the looks of it) seems to explain this rather well. If I should have good experience with using Cache::Cache (which I actually have, very nice module), this doesn't mean that it will also speed up your code in it's specific setting (maybe your disk is 10 times as fast as mine is?).	[reply]
Re: Repetitive File I/O vs. Memory Caching by dragonchild (Archbishop) on Mar 29, 2004 at 02:01 UTC
My experience is in using prefork, but I cache some 10M of data from the database in the startup.pl, which gave me a 99% speedup on all my pages. (The page design is weird - it requires a potential 10k hits against a database, depending on the user's usertype.) I can't imagine that threaded would be that different. My only contribution to the thread would be to cache everything before page hits. (But, I know nothing about threaded MP.) ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose	[reply]