The data has to be read twice and written twice. The merging into to final form happens during 2nd write
I'm sorry, but yet again, no.
The only reason to sort, is if the dataset is too large to fit in memory. Otherwise there is absolutely no good reason to use an O(N logN) sort algorithm when an O(N) hash algorithm does the job.
Only once the dataset grown so large that it is impossible to hold all the records in memory at one time, does the sort&merge algorithm have any merit whatsoever.
And if you cannot hold the whole dataset in memory then you cannot sort it in memory is a single operation. So, you use a disk sort that reads a subset of the data, and writes the sorted subset to a temporary file. Then you read another subset into memory and sort it and write the results to a temporary file. And so on until you have S sorted subsets in temporary files. Now you need to merge those subsets together.
Read and sort, and write to temp; read from temp to merge and write to sorted. And that produced one sorted file. 2N reads; 2N writes; one sorted file.
Now you need to repeat that for the seconds file. 4N reads; 4N writes; two sorted files.
Now you need to read both sorted files, merge them and write the final merged output.
That's 8N reads; and 8N writes; one resultant merged file.
Total 16N IOPs. Compared to 6N IOPs for the hash&memory algorithm.
And if, after 3 attempts of my trying to explain this to you, you still cannot see it, please keep it to yourself, because you are simply wrong. Sorry.
In reply to Re^7: Working on huge (GB sized) files
by BrowserUk
in thread Working on huge (GB sized) files
by vasavi
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |