Everyone,
Apparently I'm thick-skinned this week. I'll give it a try starting from the top.
- The entire project is already written (poorly) in Perl and it consists upwards of 30,000 lines. I have inherited it and am reworking it the best I can.
- We are provided a number of files (database dumps) from the customer. These are alike in that they all have fixed-width fields; however, the number of fields and sizes of these files differ. The files arive in ASCII and the records are separated by new lines. We have documentation on the layouts of each file, so we know how to get certain fields; this is currently being done with substr. For example, substr($record, 0, 10) would give you the ID.
- Multiple (hundreds of) types of output can be generated from these files based on certain field criteria; e.g., pulling ID=123 and TYPE=this is going to give you a different data set that pulling ID=456 and TYPE=that.
- The challenge is the source doesn't have everything that I need. I need to print a name, but I only have the ID in the source. The ID is in another file. If this was SQL I would be using JOIN, but it's not: it's a group of oftentimes huge files, like the aforementioned 2.5 gigger.
A very small example:
a_supporting_data_file:
123456789John Doe LotsOfOtherFields
987654321Bob SmithLotsOfOtherFields
...hundreds of thousands of more records...
source_file:
LotsOfFields123456789LotsOfFields
LotsOfFields987654321LotsOfFields
...tens of thousands of more records...
When I'm while-ing through the source file and I substr out the ID of 123456789 (I know this is the ID due to the documentation that says field at position X with a length of Y is the such-and-such), I need to have the data "John Doe" available to me. When I substr out the ID of 987654321, I need to have the data "Bob Smith" available to me.
What if user A and user B are running separate instances of the program and they both need to use a_supporting_data_file? It is re-parsed and re-hashed.
I hope things are clearer. Thanks for your help thus far.
A lot of your answers have been helpful. I just have to see how they would fit into the existing code, and the existing time frame. As I said before, I think it needs to be rewritten, but here's the situation: This project is no longer the responsibility of the old programmer, therefore I'm in the seat of "You need to understand as much as possible as soon as possible." I've already spent a great deal of time making structures, routines, modules, documentation (and adding "use warnings;" and "use strict;") that my timeline for improvements is slowly decreasing. I guess you could say that I'm looking for a half-way solution. I can always leave the code as is because it
does work, it's just hideously inefficient, and I'd like to learn new methods of handling this.