Re: Moving from hashing to tie-ing.

Everyone,

Apparently I'm thick-skinned this week. I'll give it a try starting from the top.

The entire project is already written (poorly) in Perl and it consists upwards of 30,000 lines. I have inherited it and am reworking it the best I can.
We are provided a number of files (database dumps) from the customer. These are alike in that they all have fixed-width fields; however, the number of fields and sizes of these files differ. The files arive in ASCII and the records are separated by new lines. We have documentation on the layouts of each file, so we know how to get certain fields; this is currently being done with substr. For example, substr($record, 0, 10) would give you the ID.
Multiple (hundreds of) types of output can be generated from these files based on certain field criteria; e.g., pulling ID=123 and TYPE=this is going to give you a different data set that pulling ID=456 and TYPE=that.
The challenge is the source doesn't have everything that I need. I need to print a name, but I only have the ID in the source. The ID is in another file. If this was SQL I would be using JOIN, but it's not: it's a group of oftentimes huge files, like the aforementioned 2.5 gigger.

A very small example:

a_supporting_data_file:
123456789John Doe LotsOfOtherFields
987654321Bob SmithLotsOfOtherFields
...hundreds of thousands of more records...
source_file:
LotsOfFields123456789LotsOfFields
LotsOfFields987654321LotsOfFields
...tens of thousands of more records...

When I'm while-ing through the source file and I substr out the ID of 123456789 (I know this is the ID due to the documentation that says field at position X with a length of Y is the such-and-such), I need to have the data "John Doe" available to me. When I substr out the ID of 987654321, I need to have the data "Bob Smith" available to me.

What if user A and user B are running separate instances of the program and they both need to use a_supporting_data_file? It is re-parsed and re-hashed.

I hope things are clearer. Thanks for your help thus far.

A lot of your answers have been helpful. I just have to see how they would fit into the existing code, and the existing time frame. As I said before, I think it needs to be rewritten, but here's the situation: This project is no longer the responsibility of the old programmer, therefore I'm in the seat of "You need to understand as much as possible as soon as possible." I've already spent a great deal of time making structures, routines, modules, documentation (and adding "use warnings;" and "use strict;") that my timeline for improvements is slowly decreasing. I guess you could say that I'm looking for a half-way solution. I can always leave the code as is because it does work, it's just hideously inefficient, and I'd like to learn new methods of handling this.

Comment on Re: Moving from hashing to tie-ing. Download Code

Replies are listed 'Best First'.
Re^2: Moving from hashing to tie-ing. by kwaping (Priest) on Aug 02, 2006 at 14:36 UTC
If it were me in this situation, I would leave the ugly-but-working code intact to handle current processing. I would then study it thoroughly to determine the original specs of the project. I would then reverse-engineer a ground-up rewrite of the code using those specs. That way, you're not constrained by any legacy code conventions and can do it your way right from the start. You have the added advantage of a more relaxed time-frame, since the current code is still working in the production environment. --- It's all fine and dandy until someone has to look at the code.	[reply]
Re^3: Moving from hashing to tie-ing. by eff_i_g (Curate) on Aug 02, 2006 at 19:13 UTC
I'm in agreement with your approach kwaping, thank you. Well, at least I learned a little bit about tie through this whole debacle: don't use it on huge files; and I started reading the chapter. When and if the rewrite takes place, I think putting the data back into a database is the way to go. Some advanced SQL statements would eliminate a great deal of the Perl coding. I'll see. Thanks for your time.	[reply]
Re^2: Moving from hashing to tie-ing. by Limbic~Region (Chancellor) on Aug 02, 2006 at 19:54 UTC
eff_i_g, Ok, this still doesn't answer my questions but I do have what I believe to be a half-decent suggestion for you. You do not indicate how often you are provided these dumps from the customer or how many "runs" are done on the data in between new dump files. Assuming the dumps arrive no more then once a day and that the number of "runs" in between new dumps is more than a few - the following methodology should improve the efficiency of the existing code with only minor modifications: First, create a pre-process script that parses the huge source file and supporting data file one time. Its job is to index the position of each ID in the file. This information should be stored in a database (DBD::SQLite or some such) or in a serialized datastructure (Storable or some such). What this buys you is the ability to, given an ID - open the 2 files and quickly read in just the record associated with that ID. No searching required and no parsing of non-related IDs necessary. Second, make a minor modification to the current script that uses the pre-processed index to pull in just the record(s) associated with that ID. Now you can create as complex a datastructure as makes sense and need not constantly re-split. This ultimately is not what I would like to suggest but given the lack of details it is the best I can offer. Cheers - L~R	[reply]
Re^2: Moving from hashing to tie-ing. by rvosa (Curate) on Aug 02, 2006 at 11:49 UTC
What is the objection against using a database? You told us the files are database table dumps. Maybe you can get the sql commands to recreate the database locally, then just load up the files and use the recommended cpan database stack (e.g. DBI, DBD::mysql, SQL::Translator, DBIx::Class). What sort of time frame are we looking at?	[reply]