Disk-based sorts get around the memory problem by reading a chunk of the source, sorting it and writing it to a temp file; read another chunk, sort and write to temp.Yes, agree.
Whilst the reading and writing of the chunks are sequential, they are interleaved, and that alone has a substantial performance impact over reading the whole sequentially and then writing the whole sequentially.I am not so sure about this. I guess that what you mean by interleaved is that program switches back and forth from read to write. Typically the huge input file would be read sequentially, but by reading big hunks (maybe thousands of records at a time). Sort. Then write big hunk to a new file. So data gets read once and written once.
In the create hash table scenario, the data only has to be read once - there is no write. Presumably the data is also read in big hunks, say same size hunk as above.
The system will certainly allow other I/O processing right after the "big block" request has been satisfied. The system probably is also interleaving other disk I/O requests while the "big block" operation is proceeding. That's true in either scenario above. So when 2nd read needs to happen in either case, the disk is most likely somewhere else and not sitting right at next block to be read. I see a 2x difference, which is "substantial". I don't know if you meant that it was a lot more than that or not.
There are of course all kinds of things that can be happening cache wise. Not clear how much if any of the write data will be kept once it is flushed to disk as the file presumably is opened as write only as a big clue. In both scenarios there could also be read-ahead happening.
I think the big difference is that in the hash table scenario, each disk operation is going to be getting a random record, maybe 2 different records from different parts of the file. Whereas, the merge is going to be getting thousands of sequential records and does this when one of its n queues runs dry. So it looks to me like the hash approach has the potential for generating many more disk operations than the sort/merge - enough to swamp this initial 2x I/O difference.
How all of this is going to work out is very dependent upon what system it is running on, how the caching is set up and what other processes are doing at the same time, how much of the total system resources this collation process is able to use... I think this is going to come down to a YMMV situation. And although the sort is a Nlog N operation versus an N operation for scanning and building the hash, the whole job is going to be dominated by disk I/O time. The CPU operations won't matter even if there is a heck of a lot more of them because they are so fast compared with the disk. I suspect the hash approach uses less actual CPU time, but takes more "wall clock" time.
In reply to Re^4: Working on huge (GB sized) files
by Marshall
in thread Working on huge (GB sized) files
by vasavi
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |