in reply to Caching files question

I want to speed up the process of retrieving information from files. In my code for each query I loop over files. For each file I retrieve data into array and process.

That's a bit vague. How badly is optimization needed, really? (Don't bother optimizing if you don't have to.) If you really need to optimize, how much of the problem is really i/o-bound, as opposed to cpu-bound? (You should try profiling the code to see where most of the run-time is spent: disk reads or processing loops?)

One way is to store all these files in array of arrays and then I guess it will do it much faster.

My question is: is there an easy way to avoid reprogramming and use cache so that files retrieved one time from HD will be used in subsequent queries without retrieving them over and over from HD.

If the amount of data in question fits easily into available ram, and if your code involves handling lots of queries on the same data in a single run, then obviously you will want to load all the data into memory at start-up, then process all the queries using the in-memory arrays (or hashes, or whatever), and that will be the easy way.

If the amount of data is uncomfortably large (e.g. won't fit in ram, or even if it does, it takes too long to load it all at start-up), you should consider using some sort of relational database, or DBM (hash) files for disk storage. Since you are talking about processing "queries", the most effective solution for a large and/or complicated data set is to do a suitable amount of indexing up front, and this is generally a simple matter of storing the data into indexed file structures (relational database tables or dbm hash files).

By using an existing RDB query engine or the appropriate flavor of  tie %my_hash, ..., "my_hash_file", ... you get quite a lot of optimization for free -- both in terms of improving access speed when reading data from disk, and in terms of reducing the amount of processing that needs to be coded and executed in your script. (Look at AnyDBM_File for more info about hash files.)

Replies are listed 'Best First'.
Re^2: Caching files question
by vit (Friar) on Aug 16, 2008 at 19:07 UTC
    graff, by query I do not mean sql query. My files are not in DB, I do not use DB at all.
    Yes, retrieving data from HD takes much more time than processing and again I know how to rearrange my code to retrieve it just once.
    Yes my data fits RAM.
    I just want to know if it's possible to do caching (besides OS caching) without rearranging the code.
      I just want to know if it's possible to do caching (besides OS caching) without rearranging the code.

      Sorry... I wouldn't know how to answer that without seeing some of the code in question. Actually, I'm not sure what that question means, if you are trying to ask about something other than loading all the file data into RAM.

      Maybe BrowserUK has provided a relevant suggestion, if your existing code has "open()" statements scattered throughout that happen to be reopening the same files over and over. But if that is the case, perhaps going ahead with a suitable refactoring of the code would be time well spent.

      (Updated to adjust grammar and punctuation, and to add the following:)

      by query I do not mean sql query.

      I wasn't specifically referring only to sql queries either, but was including an RDB solution as a possible alternative (along with hash look-up, which does not involve sql but could be viewed as another type of query). You've left it unclear what "query" means in your app, but perhaps that's not relevant.

      My files are not in DB, I do not use DB at all.

      And my point was: if your existing app is problematic in some way, you might want to consider using some sort of DB as an alternative for improving it.