in reply to Speed searching HTML docs
Are the documents being updated through a web interface that you have control of as well? I.e., can you do something that is driven off of the update event rather than have to poll the last updated time? If so, you could do a rebuild on document update (with a mechanism to avoid multiple concurrent updates).
If you are forced to poll file timestamps to determine when to update, one possibility would be to use the first approach you mention above (updating on demand when a search is requested) with some modifications:
1. Whenever you rebuild the cache, store the fact that a cache rebuild was initiated at such and such a time. After successful completion of the cache rebuild, store the time that the cache rebuild started.
2. When a search occurs, you can compare the file timestamps to the cache rebuild start timestamp. If any file is newer than the cache rebuild start time, a rebuild is needed.
3. To avoid having to do step 2 very often, you can also record the last time you did step 2 and only do it again after some fixed time elapsed (time-to-live).
4. You would need some mechanism to avoid running multiple cache rebuilds concurrently, but you also need a way to prevent that mechanism from locking out all future cache rebuilds if a cache rebuild failed part way through.
5. The user that caused a cache rebuild could be returned results from a search against the old keyword cache, so that he doesn't have to wait for the rebuild to take place (if that is acceptable).
6. You might also need a mechanism for preventing step 2 from running multiple times concurrently.
This would be much easier on an operating system that had a reliable task scheduler like cron.