in reply to Reg: Performance
Yup... do it the COBOL way. Do it just like they did it before digital computers existed.
Sort the two files. Now you can use the diff command, or logic identical to that command, to locate the records that are common and the records (in each file) that are different.
As a test, while writing this, I generated a disk-file of five million random strings and disk-sorted it. Nine seconds. Hmm... kinda slow...
Let’s say that each of these are ... not a file of millions of records, but two packs of playing-cards that you accidentally dropped on the floor. (Or, as in my case, a great big box of 80-column punched cards that you just dropped onto the floor.) You possess a “magic sorting-box” that can sort a deck of cards in a blink of an eye. Put both decks, in turn, into the box and push the button. Take the two now-sorted decks and turn over the top card in each. Because you know that the two decks are sorted identically, you can answer a great many questions just by looking at the next card (if there is one, and remembering what the preceding card was) in each of the two decks. You do not have to “search,” nor do you have to go backward. In just one sequential pass through the two files in parallel, you can merge them, compare them, detect whether records are missing, identify gaps and their size, and so on. (Almost) no memory required.
In the days of reel-to-reel magnetic tapes and punched cards, yup, that’s what they were doing. And it worked. Still does. Long before IBM got into the business of renting computers, they rented punched-card sorters and collating machines. (And sold punchcards by the truckload.)
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Reg: Performance
by sivaraman (Initiate) on Oct 29, 2010 at 04:18 UTC | |
by choroba (Cardinal) on Oct 29, 2010 at 07:45 UTC |