in reply to Re^4: Working on huge (GB sized) files
in thread Working on huge (GB sized) files
. So data gets read once and written once.
No. You forgot that the sort needs to merge back the temporary files in to the sorted output file.
So each record is read once sorted and written once to a temporary file. It is then read again from that temporary file during the merge and written to the sorted output file. So that two reads and two writes.
And the same has to be applied to the second file.
And then both sorted files have to be read, merged and written to the final merged file.
I think the big difference is that in the hash table scenario, each disk operation is going to be getting a random record, maybe 2 different records from different parts of the file. Whereas, the merge is going to be getting thousands of sequential records and does this when one of its n queues runs dry.
Again, no. You seem to be assuming that the merge consists of: reading temp file 1 & writing it to merged file; reading temp file 2 & writing it to merged file; and so on. It doesn't.
It requires reading the first record (or block) from each of the temporary files; comparing the records frm all the files against each other and selecting teh lowest and writing it. Reading the next record from that file (or cache); repeat until the end of all the files.
There are no "thousands of sequential records". There may be occasional runs of sequential records from a given file, but each of those records needs to be compared against all of the next records from each of the other files before it is know whether it is sequentially the next record to be written or not. And sometimes, there will be runs where no two sequential records in the output file, comes from the same input file.
It is exactly this entirely unpredictable, entirely random access pattern that so badly affects caching heuristics, combined with the nearly three times as many IOPs that makes sort&merge at least an order of magnitude slower than hash&memory, all else being equal.
Please note. That isn't based on theoretical guesswork, but actual tests. And not just once. And not just recently. But many times going back many years.
Almost exactly 30 years ago, we sped up the merging if 5 sets of 6 million records using twin PDP-11/60 and (for the time, huge) "ganged" RK07 28MB disk packs, by 11 times using exactly this technique. Because the data was bigger than the combined disks we could have on line, by merging the key-fields and record offsets in memory, by loading each of the input disk packs once in turn, then the output pack once, rather than having to constantly swap input packs with temporary packs with the input packs again, then the output pack; as with the previous sort all 5 and then merge process.
And I went through the same comparative process using real files, disks and data just a couple of months ago in response to a post right here in this site. And a couple of years before than on my previous machine.
And it is neither theory nor speculation when I tell you that the differential has never been below an order of magnitude, and because I/O speeds have stagnated, whilst cpu speed have risen exponentially (until recently), and RAM sizes have and continue to grow exponentially, the ratio of performance gain of the hash&memory over sort&merge is increasing.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^6: Working on huge (GB sized) files
by Marshall (Canon) on May 13, 2011 at 22:47 UTC | |
by BrowserUk (Patriarch) on May 13, 2011 at 23:18 UTC |