Re^2: Moving from hashing to tie-ing.

eff_i_g,
Ok, this still doesn't answer my questions but I do have what I believe to be a half-decent suggestion for you. You do not indicate how often you are provided these dumps from the customer or how many "runs" are done on the data in between new dump files. Assuming the dumps arrive no more then once a day and that the number of "runs" in between new dumps is more than a few - the following methodology should improve the efficiency of the existing code with only minor modifications:

First, create a pre-process script that parses the huge source file and supporting data file one time. Its job is to index the position of each ID in the file. This information should be stored in a database (DBD::SQLite or some such) or in a serialized datastructure (Storable or some such). What this buys you is the ability to, given an ID - open the 2 files and quickly read in just the record associated with that ID. No searching required and no parsing of non-related IDs necessary.

Second, make a minor modification to the current script that uses the pre-processed index to pull in just the record(s) associated with that ID. Now you can create as complex a datastructure as makes sense and need not constantly re-split.

This ultimately is not what I would like to suggest but given the lack of details it is the best I can offer.

Cheers - L~R

Comment on Re^2: Moving from hashing to tie-ing.