Re: Dear Monks

“Holy COBOL, Batman! I do believe that there is Yet Another Way To Do It!”

Consider sorting the input file. (And please don’t imagine that “five million lines” is really “a particularly large file” for a modern digital computer. The job will be done in a small handful of seconds.)

When you do that, now all occurrences of all records which have the same key will be adjacent, so that you can design the entire algorithm to work by comparing “the record that we have just read from the input file” to “a saved copy of the previous record, if any, that we have read.” Now, suddenly, the algorithm does not require obnoxious amounts of memory ... in fact, it requires hardly any memory at all. You only need enough memory to hold two records: this one, and the preceding one.

For more information about this technique ... which, by the way, is considerably older than digital computers ... google the term, “unit record processing.” This is what IBM was doing with all those punched-cards, and this is also what was being done in all those sci-fi movies with the rapidly spinning tapes. Of course, the original onus for doing things this way was because you didn’t have enough memory (or computing horsepower) available to do it any other way. Nevertheless, “old” though the idea might be, it is still one of the most powerful concepts in data processing, and it is extremely relevant (although usually overlooked) today. Dr. Knuth called one of his volumes, Sorting and Searching, for a very solid reason.