in reply to Re: Moving from hashing to tie-ing.
in thread Moving from hashing to tie-ing.

Browser,

The fields range from about 3 to 100. There are many types of pins; these are no more than a dozen characters. Many users can run the script, which works in this fashion:
  1. Determine the section and id being processed.
  2. Hash all of the needed customer files for this section.
  3. Process.
If user A and user B are running the same section, but different ids, there are two instances of the script which hash all of the same information into memory. The processing is only slightly different due to the ids.

Replies are listed 'Best First'.
Re^3: Moving from hashing to tie-ing.
by BrowserUk (Patriarch) on Jul 31, 2006 at 16:50 UTC

    Is the content of the file static or does the processing involve updates?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Static. Throughout all of the processing nothing is changed, only a new file is created.

        Since the data is static,no locking or updating is required, move the existing "build a hash but don't split the fields" code into a separate script that does that, opens a port and listens. This takes around 3 minutes to do a 2.5 GB file containing ~8 million records on my system.

        This server script needn't be complicated as all requests will be of the form:

        1. Listen
        2. Read key
        3. Reply with record from memory.
        4. Loop.

        In each script that you removed the hash building code, replace it with a call to tie the hash, instead of building it.

        Create a Tie::Hash module that only implements the TIEHASH and FETCH methods.

        The TIEHASH method connects to the listening port (or starts the new script in the background if the port is unavailable and then connects).

        The FETCH method checks it's local cache for the request key and if not found, posts the key to the background script and reads back the record, splits it into fields and caches it locally in a hash as an array (ref).

        Now,

        1. The huge file is loaded only once.
        2. The records only get split once upon request, and are thereafter supplied, already split, from local cache.
        3. Your modifications to the existing scripts are confined to the removal of the hash loading code and replacing it with a very simple tied hash. The rest of the code remains unchanged and runs much faster.
        4. If you ever get around to loading the data into a real DB, the tied hash interface can be modifed under the covers to retrieve the information from there and again, the rest of the existing code requires no further modification.

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.