I recently undertook a task for a client that involves generating a large amount of data. We're mostly a Java shop anymore, but Perl seemed like the best choice for the data munging necessary.
It was quite simple to implement the algorithm I was given. How this implementation would change over the course of the project has been a real learning experience for me.
Quick, non-NDA breaking overview of the project:
Create a set of data, with one entry of output corresponding to one line of input in a supplied file. Verify that there are no duplicate lines of data, and that the data conforms to its rules. The generated data tends to be quite large - 100,000 lines of input produces a 43MB output file.
First pass:
The first implementation I came up with read the entire input file into memory, then looped over it creating each line of data. Once all of the data was created, it was written to file.
The problems with this approach should be obvious. On created data sets of 10, 100, 1000, 10000, and 100000 entries, I noted that the time required to run and the memory used was a straight factor of ten. Creating 100,000 lines of data required ~65MB of memory. Not great, but not a showstopper either.
Then, we got word that we would be creating
1,000,000 (yes, one meeelion) records at a time, far above the upper bound we had tested. At a straight factor of ten increase in memory, I could see that something would have to change.
Second pass:
For the second pass, I used a simple caching mechanism. The output file was created and held open for the duration of the script. Once the number of generated records in memory equalled 10,000 they were written to disk and the in memory storage
undef'ed.
This little trick had the benefit of reducing the maximum memory footprint to around 8MB, while not increasing the run time noticably.
Third pass:
For the final iteration, I tackled the input file problem. Instead of reading the file into memory, then looping over it, I changed the code to read the data one line at a time. As each line was read, a record was created. In this manner, only one line of input and 10,000 records max are ever in memory at once. This final tweak brought the memory consumption down to 3MB max.
I'm sure this story isn't exactly news to most of you, but it was a fun experience for me. Not a lot of emphasis is placed on performance for general user apps. People have become so used to the fact that most boxes have large amounts of memory and storage that tweaking an app to run just a bit faster, or to be bit less of a memory hog is not even on the development radar. In fact, I tried telling some of my coworkers why I was so happy with the app, but none of them seemed to think it was a big deal. I think they would have been perfectly happy shipping a product that would have used around 650MB of memory - that's what swap space is for, right?