Re: Help for finding duplicates in huge files

With memory sizes as they are today, 1.5 million rows is “just not that interesting.” If you had data volumes several times larger than that, you could do it like they did it before computers came along: sort both trays of punched-cards into an identically sorted sequence, then run both trays at the same time through the merge-machine.

If you are using an SQL database, simply ORDER BY the same key. When you process n data streams that you know to be identically sorted, you simply select the smallest of the “current records” from each of the n data streams until all streams are at end-of-file.

External sorting algorithms are among the most heavily studied. Dr. Knuth called one of his tomes, Sorting and Searching, for a very good reason.