Re: Reg: Performance

Yup... do it the COBOL way. Do it just like they did it before digital computers existed.

Sort the two files. Now you can use the diff command, or logic identical to that command, to locate the records that are common and the records (in each file) that are different.

As a test, while writing this, I generated a disk-file of five million random strings and disk-sorted it. Nine seconds. Hmm... kinda slow...

Let’s say that each of these are ... not a file of millions of records, but two packs of playing-cards that you accidentally dropped on the floor. (Or, as in my case, a great big box of 80-column punched cards that you just dropped onto the floor.) You possess a “magic sorting-box” that can sort a deck of cards in a blink of an eye. Put both decks, in turn, into the box and push the button. Take the two now-sorted decks and turn over the top card in each. Because you know that the two decks are sorted identically, you can answer a great many questions just by looking at the next card (if there is one, and remembering what the preceding card was) in each of the two decks. You do not have to “search,” nor do you have to go backward. In just one sequential pass through the two files in parallel, you can merge them, compare them, detect whether records are missing, identify gaps and their size, and so on. (Almost) no memory required.

In the days of reel-to-reel magnetic tapes and punched cards, yup, that’s what they were doing. And it worked. Still does. Long before IBM got into the business of renting computers, they rented punched-card sorters and collating machines. (And sold punchcards by the truckload.)

Replies are listed 'Best First'.
Re^2: Reg: Performance by sivaraman (Initiate) on Oct 29, 2010 at 04:18 UTC
Dear Friend, I am little bit of confused here. In DUMP_A i have unique id and DUMP_B i have unique id along with ACCT #. Consider, this is my inputs. DUMP_A: ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c20 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c20 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 DUMP_B: ffe47aadf1add54e3a8e925b40530c20\|323568945210360 ffe47aadf1add54e3a8e925b40530c20\|323568945210361 ffe47aadf1add54e3a8e925b40530c20\|323568945210362 ffe47aadf1add54e3a8e925b40530c20\|323568945210363 ffe47aadf1add54e3a8e925b40530c20\|323568945210364 [download] It should take unique id from DUMP_A and check with DUMP_B, if it matches then both the value has to be write into the new file. The output should be as the below for the above inputs: `ffe47aadf1add54e3a8e925b40530c20\|323568945210360 ffe47aadf1add54e3a8e925b40530c20\|323568945210361 ffe47aadf1add54e3a8e925b40530c20\|323568945210362 ffe47aadf1add54e3a8e925b40530c20\|323568945210363 ffe47aadf1add54e3a8e925b40530c20\|323568945210364` [download] - Thank you.	[reply] [d/l] [select]
Re^3: Reg: Performance by choroba (Cardinal) on Oct 29, 2010 at 07:45 UTC
BTW, the id in DUMP_A is not unique. Nevertheless, you can still use the algorithm: Sort both input files. Take the first id from DUMP_A. Process DUMP_B while it contains the same id. Move to the next id in DUMP_A, repeat until the end of DUMP_A.	[reply]