in reply to Re: Moving from hashing to tie-ing.
in thread Moving from hashing to tie-ing.

eff_i_g,
Presumably, your problem is one of performance though you never come out and say it. You haven't told us enough about how this datastructure works within the program to provide any ideas beyond just changing your data structure.

Here are some things to think about. Perhaps instead of changing the pipes into hashes (HoH), you could use arrays instead (HoA) as they take up less space. You don't indicate if having the entire data structure in memory at once is even necessary. One possibility would be to load only the portion of the structure necessary to do any unit of work at a time. While this will add I/O, it should allow you to trade memory for time since your memory requirements will now be manageable.

I have lots of other ideas but they are all what-ifs until you share more about how your program works and how the datastructure works within that program.

Cheers - L~R

Replies are listed 'Best First'.
Re^3: Moving from hashing to tie-ing.
by eff_i_g (Curate) on Jul 31, 2006 at 16:15 UTC
    Limbic,

    The files we receive are fixed length and many of the fields are not even used. A file may contain a dozen fields, but we may only need 2 or 3. Since these files are extracted from their database there is a lot of id matching to be done, which is mainly what the hashes do, such as $name_hash{'123'} = { first => 'John', last => 'Doe' };. Once all of the supporting files are hashed in this manner, a main script uses them as lookup tables. Therefore, every time it sees a record that has '123' in a certain field, it knows to use "John Doe" during the processing.

    Having all of the data in memory is not necessary; however, I do not know how this could be done without using a database. Each record, from a few dozen to a few thousand, needs to use these supporting hashes.

    Let me know if you need more information, I appreciate your help.
      eff_i_g,
      You really haven't said anything at all about how the program works or how it decides what data it needs and when.

      Since many of your fields are not needed, they need not be included in your data structure provided you can no in advance that they won't be needed. If only 1 id is ever worked with at a time, then there is not a need to ever load more than one record in memory at a time. Alternatively, it may be possible to employ a MRU cache such that the splits are cached in arrays but only a fixed number are cached where the most recently used stay in cache and others expire.

      Try to put yourself in my shoes. Read what you have written about your program, your datastructure, and your problem and see if you feel you have provided the necessary information to help. Again, we are just guessing.

      Cheers - L~R

        Limbic,

        I apologize; I'm trying :) This is a little challenging since I am also learning.

        The basic programming process is explained in my reply to BrowerUk. It's that simple, but it deals with a lot of information. The problem is with step 2 because it hashes all of the data provided, when the script may only need a fraction of it.

        To reiterate: Correct. The whole lookup is not needed for processing. The pins that are needed could be determined by reading all of the pins in the source file; the largest one is around 25MB, 40,000 lines. If the file was only using the pins 123, 456, and 789, I could only look for these in the other file to hash.