vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I want to speed up the process of retrieving information from files. In my code for each query I loop over files. For each file I retrieve data into array and process.
One way is to store all these files in array of arrays and then I guess it will do it much faster.
My question is: is there an easy way to avoid reprogramming and use cache so that files retrieved one time from HD will be used in subsequent queries without retrieving them over and over from HD.

Replies are listed 'Best First'.
Re: Caching files question
by graff (Chancellor) on Aug 16, 2008 at 17:01 UTC
    I want to speed up the process of retrieving information from files. In my code for each query I loop over files. For each file I retrieve data into array and process.

    That's a bit vague. How badly is optimization needed, really? (Don't bother optimizing if you don't have to.) If you really need to optimize, how much of the problem is really i/o-bound, as opposed to cpu-bound? (You should try profiling the code to see where most of the run-time is spent: disk reads or processing loops?)

    One way is to store all these files in array of arrays and then I guess it will do it much faster.

    My question is: is there an easy way to avoid reprogramming and use cache so that files retrieved one time from HD will be used in subsequent queries without retrieving them over and over from HD.

    If the amount of data in question fits easily into available ram, and if your code involves handling lots of queries on the same data in a single run, then obviously you will want to load all the data into memory at start-up, then process all the queries using the in-memory arrays (or hashes, or whatever), and that will be the easy way.

    If the amount of data is uncomfortably large (e.g. won't fit in ram, or even if it does, it takes too long to load it all at start-up), you should consider using some sort of relational database, or DBM (hash) files for disk storage. Since you are talking about processing "queries", the most effective solution for a large and/or complicated data set is to do a suitable amount of indexing up front, and this is generally a simple matter of storing the data into indexed file structures (relational database tables or dbm hash files).

    By using an existing RDB query engine or the appropriate flavor of  tie %my_hash, ..., "my_hash_file", ... you get quite a lot of optimization for free -- both in terms of improving access speed when reading data from disk, and in terms of reducing the amount of processing that needs to be coded and executed in your script. (Look at AnyDBM_File for more info about hash files.)

      graff, by query I do not mean sql query. My files are not in DB, I do not use DB at all.
      Yes, retrieving data from HD takes much more time than processing and again I know how to rearrange my code to retrieve it just once.
      Yes my data fits RAM.
      I just want to know if it's possible to do caching (besides OS caching) without rearranging the code.
        I just want to know if it's possible to do caching (besides OS caching) without rearranging the code.

        Sorry... I wouldn't know how to answer that without seeing some of the code in question. Actually, I'm not sure what that question means, if you are trying to ask about something other than loading all the file data into RAM.

        Maybe BrowserUK has provided a relevant suggestion, if your existing code has "open()" statements scattered throughout that happen to be reopening the same files over and over. But if that is the case, perhaps going ahead with a suitable refactoring of the code would be time well spent.

        (Updated to adjust grammar and punctuation, and to add the following:)

        by query I do not mean sql query.

        I wasn't specifically referring only to sql queries either, but was including an RDB solution as a possible alternative (along with hash look-up, which does not involve sql but could be viewed as another type of query). You've left it unclear what "query" means in your app, but perhaps that's not relevant.

        My files are not in DB, I do not use DB at all.

        And my point was: if your existing app is problematic in some way, you might want to consider using some sort of DB as an alternative for improving it.

Re: Caching files question
by BrowserUk (Patriarch) on Aug 17, 2008 at 00:11 UTC

    You could use something like this:

    my %cache; sub myOpen { my( $mode, $path ) = @_; my $fh; if( exists $cache{ $path } ) { open $fh, $mode, \$cache{ $path } or return; } else { open $fh, $mode, $path or return; { local $/; sysread( $fh, $cache{ $path }, -s( $path ) ) or return; close $fh; } open $fh, $mode, \$cache{ $path } or return; } return $fh; }

    Substitute calls to myOpen() for your existing calls to open. If the file hasn't been opened before, it is opened in the usual way, but its contents are slurped into the cache hash, keyed by the pathname, and the file is closed.

    That hash value (scalar) is then opened as a ram file, and the ramfile filehandle returned to the caller. If the path already exists in the cache, just this last step is required.

    I'm assuming these files are read-only. You can write to ramfiles, but then you're into the task of ensuring that they get written back to the filesystem in a timely manner.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      It's reasonable to assume that vit would figure this out, but maybe it's worth mentioning that in order for "myOpen" to work, the existing code has to switch to using lexically scoped file handles (if it is not doing so already), and the existing open() calls have to change from whatever syntax has been used so far to exactly this syntax (moving the file handle out of the arg list, and making sure that mode and pathname are separate args):
      my $fh = myOpen( "<", $pathname )
      (with an added "or die ...", as appropriate)

      Also, I was curious why you bother to localize $/, given that you are calling sysread, which doesn't use $/ at all. And when doing sysread, it would be good to check the return value more carefully -- zero means total failure, but any value other than the size of the file would mean a partial failure, which would probably be just as bad:

      my %cache; sub myOpen { my( $mode, $path ) = @_; my $fh; if ( not exists $cache{ $path } ) { -s $path or return; # don't do a 0-length or non-existent f +ile open $fh, $mode, $path or return; ( sysread( $fh, $cache{ $path }, -s _ ) == -s _ ) or return; close $fh; } open $fh, $mode, \$cache{ $path } or return; return $fh; }
      (updated to include the whole subroutine with a simplified conditional block, added the check for non-zero return from "-s", and removed an unnecessary "$size" variable)
        BrowserUk and graff,
        Thank you very much, I think this is exactly what I need.
        graff,
        please let me know what you mean by "using lexically scoped file handles". You mean out of scope of sub myOpen()?
        What I am doing is I am calling Script3.pl from Script2.pl from Script1.pl and files are opened inside Script3.pl. So does it mean that I have to open files in external Script1.pl?
        Am I right that files will not be cached any more once I stop external script?
        Also, I was curious why you bother to localize $/ ...

        Untested code and two minds about how I would implement it. It also requires that the OP change his existing code to assign the returned filehandle, rather than passing it as a parameter per open. Which I considered a good thing.

        Like you earlier, I'm not really sure what circumstances this would be useful, so I raised the possibility without putting too much effort into trying to make it bullet proof. I wanted the OP to either be sufficiently aware to fix it up himself, or ask.

        I like your re-write++. One additional change I would make is to use a hard coded '<:raw' mode on the real open and the user supplied mode on the ramfile open. As you have it, if he passes a non-read mode things will go wrong. Though, that might be a good thing also...ponder...undecided.

        It might also be worth doing some rudimentary, Is this a huge file? check. I don't like arbitrary limits, but issuing a warning if the file is bigger than say 100MB might be the clue stick to avoiding mysterious failures.

        I also thought some about using one of those modules I never use to canonal...canonica....to ensure the paths are absolute and unique--save loading the same file twice--but with all the convolutions possible on *nix, it would take some serious thought.

        Oh. And I'd definitely use unless( exists $cache{ $path } ) { :)

        All of that said, from the OPs latest description of the application (nested CGI calls), none of this is likely to help, as the cache will get re-built every time the scripts are run.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Caching files question
by kyle (Abbot) on Aug 16, 2008 at 19:21 UTC

    Is Memoize what you're looking for? Generally, I agree with JavaFan. Caching file contents is a job for the OS. I agree also with graff that you should really confirm that disk reading is your bottleneck before you try to optimize it.

Re: Caching files question
by JavaFan (Canon) on Aug 16, 2008 at 15:59 UTC
    Yes, it's called 'file buffer' and is done by your OS.

    Of course, if you files are large, or if you use up all of your memory otherwise, the files may expire from your cache before you use them again.

    Perhaps the better way is read in the files one by one, then loop over all the queries per file.

      Files are not big and OS buffering does not help. There should be some perl modules or functions which can help.