Taming a memory hog

I recently undertook a task for a client that involves generating a large amount of data. We're mostly a Java shop anymore, but Perl seemed like the best choice for the data munging necessary.

It was quite simple to implement the algorithm I was given. How this implementation would change over the course of the project has been a real learning experience for me.

Quick, non-NDA breaking overview of the project:
Create a set of data, with one entry of output corresponding to one line of input in a supplied file. Verify that there are no duplicate lines of data, and that the data conforms to its rules. The generated data tends to be quite large - 100,000 lines of input produces a 43MB output file.

First pass:
The first implementation I came up with read the entire input file into memory, then looped over it creating each line of data. Once all of the data was created, it was written to file.

The problems with this approach should be obvious. On created data sets of 10, 100, 1000, 10000, and 100000 entries, I noted that the time required to run and the memory used was a straight factor of ten. Creating 100,000 lines of data required ~65MB of memory. Not great, but not a showstopper either.

Then, we got word that we would be creating 1,000,000 (yes, one meeelion) records at a time, far above the upper bound we had tested. At a straight factor of ten increase in memory, I could see that something would have to change.

Second pass:
For the second pass, I used a simple caching mechanism. The output file was created and held open for the duration of the script. Once the number of generated records in memory equalled 10,000 they were written to disk and the in memory storage undef'ed.

This little trick had the benefit of reducing the maximum memory footprint to around 8MB, while not increasing the run time noticably.

Third pass:
For the final iteration, I tackled the input file problem. Instead of reading the file into memory, then looping over it, I changed the code to read the data one line at a time. As each line was read, a record was created. In this manner, only one line of input and 10,000 records max are ever in memory at once. This final tweak brought the memory consumption down to 3MB max.

I'm sure this story isn't exactly news to most of you, but it was a fun experience for me. Not a lot of emphasis is placed on performance for general user apps. People have become so used to the fact that most boxes have large amounts of memory and storage that tweaking an app to run just a bit faster, or to be bit less of a memory hog is not even on the development radar. In fact, I tried telling some of my coworkers why I was so happy with the app, but none of them seemed to think it was a big deal. I think they would have been perfectly happy shipping a product that would have used around 650MB of memory - that's what swap space is for, right?

Guildenstern
Negated character class uber alles!

Comment on Taming a memory hog Download Code

Replies are listed 'Best First'.
Re: Taming a memory hog by liz (Monsignor) on Nov 10, 2003 at 16:11 UTC
I think they would have been perfectly happy shipping a product that would have used around 650MB of memory - that's what swap space is for, right? Welcome, sane person, in an insane world. Anecdote Alert This reminds me of the time (beginning of the 1990's) when we where competing with another product for a client. With our system it was possible to create courseware that could run off two floppy disks. With the competitors system you needed to burn a CD to deliver the product. Please note that CD-players where not as common then as they are now. And CD-burners where something that lived in CD-burning factories. We lost the bid: the reason was that you didn't need a CD-player or hard-disk to deliver our software. "If it fits on floppy disks, it can't be any good.". I guess it all comes down to impressing your neighbour. You can impress your neighbour with having to need a hard-disk or CD-player when few people in the world have one. The same goes for SUV's. It's mind boggling if you realize that with all the fuel efficient engines that have been developed, the average miles to the gallon have basically remained the same the past 100 years. Liz	[reply]
Re: Taming a memory hog by pg (Canon) on Nov 10, 2003 at 20:12 UTC
I'm sure this story isn't exactly news to most of you, but it was a fun experience for me. The most important thing is that you feel happy. I guess that's how lots of people learn things. You trust and treasure the stuffs you experienced, created, and resolved the most, more than anything you heard from others. Not to say the joy you have. Can I suggest one experiment? Give Tie::File a try on the input file side, and see how it affects your system, and your experience. In fact, I tried telling some of my coworkers why I was so happy with the app, but none of them seemed to think it was a big deal. Maybe they will start with your third pass from the beginning, because of their experience. But what a big deal, when they do this for the first time in their life, they probably took a learning curve even longer than yours ;-)	[reply]
(Guildenstern) Re: Re: Taming a memory hog by Guildenstern (Deacon) on Nov 10, 2003 at 20:20 UTC
I actually did have Tie::File in the mix at one point. I'm not sure if I was using it wrong, but my run times increased by several times. (I had to cancel the 100,000 run after waiting 2 hours - about 20 times longer than normal.) Maybe when I get some time I'll have to reinvestigate. Guildenstern Negaterd character class uber alles!	[reply]
Re: (Guildenstern) Re: Re: Taming a memory hog by Roger (Parson) on Nov 11, 2003 at 02:59 UTC
100,000 run after waiting 2 hours - about 20 times longer than normal Ok, does that mean your normal speed for processing records is `120 min /20 = 6 min` for 100,000 records? It sounds to me like there might still be room for improvement if you want to impress your client once more. I haven't seen your data, but I am doing daily processing of 20,000,000 records within 10 mins. Anyway, I am building my records with a split, your record processing might be complex, and I am just too fussy. :-)	[reply] [d/l]
Re: Re: (Guildenstern) Re: Re: Taming a memory hog by codingchemist (Novice) on Nov 12, 2003 at 00:09 UTC
Re: Re: Re: (Guildenstern) Re: Re: Taming a memory hog by Roger (Parson) on Nov 12, 2003 at 00:30 UTC
Some notes below your chosen depth have not been shown here
(Guildenstern) Re: Re: Taming a memory hog by Guildenstern (Deacon) on Nov 13, 2003 at 00:28 UTC
Re: Taming a memory hog by johndageek (Hermit) on Nov 10, 2003 at 16:57 UTC
Just goes to prove, functionality has it's own beauty. enjoy Dageek	[reply]
Re: Taming a memory hog by podian (Scribe) on Nov 11, 2003 at 19:25 UTC
If the output record can be produced by reading one line of input, I am wondering why you decided to read the whole file into memory? This remind me of something. At work, when I inspected some C code I noticed that the person is closing and re-opening the file on every write. When I asked him why, he said "when I ran the program, it did not write anything to the output file for few minutes and when I closed and opened it, I was able to see the output right away!	[reply]
Re: Taming a memory hog by rir (Vicar) on Nov 12, 2003 at 13:51 UTC
Now that you are partitioning your input how do you Verify that there are no duplicate lines of data?	[reply]
(Guildenstern) Re: Re: Taming a memory hog by Guildenstern (Deacon) on Nov 12, 2003 at 16:44 UTC
The data validation is not performed in the generation application. We've written a few companion scripts that read in the generated data, perform some integrity checks, and compare data with all other data read in. Each line of data is three records, so the script splits the data and performs a SHA-1 calculation on each record. Each SHA value is saved to a file, then File::Sort is used to sort the SHA file. Then, it's a simple matter of reading each line of the sorted file and comparing it against the previously read line to see if there's a duplicate record. I chose to compute the SHA for each record because the SHA value is significantly smaller than the record, and SHA values are guaranteed to be unique unless the records are indentical. Guildenstern Negaterd character class uber alles!	[reply]