Re^3: Working on huge (GB sized) files

First, I mostly agree with everything you've said, hence the first line of my OP above: "More of a guess given your lack of details,".

But there are a couple of points worth picking up on:

There can be trouble with step 3, because it involves seeking around within the files to gather up the records - reading randomly selected records usually takes a lot longer than reading sequential records.

Agreed. But this is often more than balanced out by the merge steps of disk-based sorts.
Disk-based sorts get around the memory problem by reading a chunk of the source, sorting it and writing it to a temp file; read another chunk, sort and write to temp.
Whilst the reading and writing of the chunks are sequential, they are interleaved, and that alone has a substantial performance impact over reading the whole sequentially and then writing the whole sequentially.
But of far bigger significance is the merge phase of the temporary files. Whilst the reading of records from a given temp file is sequential, collectively they are random. This has at least as much impact (disk seeks) as reading a single file randomly. But it also has a secondary impact, that of forcing the file-cache to juggle multiple active files, and the impacts of that are far harder to predict and can be substantial.
The problem is the nature of the patterns of usage in the merge phase. Sometimes many records will be read from one temporary file, then many from another, and then another and then (say) back to the first. At other points, the access pattern will skip from file to file for every record.
This randomly changing pattern of usage not only confuses most intelligent caching algorithms by continually switching from mostly sequential to mostly random and all points in between. The confusion can actively work against performance by (say) flushing newly-read, part-accessed blocks from the cache prematurely because previous access patterns have suggested that the access is random and so the rest the block will not be used soon. Or conversely, retaining part-accessed blocks far longer than is good for the rest of the system because access patterns suggest sequential access and so it will be needed soon.
Most (if not all) caching heuristics only consider a single file at a time, even when multiple files are being accessed by the same process, so in effect, the multiple open files of the merge phase of an external sort act as if competing processes were at work.
So the sequential reading of records randomly distributed amongst a number of temporary files is at least as costly, and usually much more so, than the random access of records from just two files. And remember that in the sort&merge approach, there are three merge phases, not one as with the hash&memory method.
The "wall clock" time of a seek operation and rotational delays is in msec. One of the sort comparisons is in nsecs. Thousands of comparisons can be performed in the time it takes to do one seek.

Yes, but ... :)
With the sort&merge, every record of both files have to be read from and written back to disk twice each during the two sorts. Once when sorting the chunks; once when merging them. and then each record has to be read and written a third time when merging the two sorted files.
Assuming (for simplicity) equal sized files:
- For the sort&merge:
  that is 8N reads and 8N writes; with 2N reads being effectively random.
- For the hash&memory:
  There are 4N reads and 2N writes; of which 2N reads are random.
So, even discounting the {2 * P * O( N/P log(N/P)} sorting .v. {O( 2N )} hashing differences, that's { 6N-SR + 2N-RR + 8N-SW } versus { 2N-SR + 2N-RR + 2N-SW }. That's a lot of wall time to be made up.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^3: Working on huge (GB sized) files

Replies are listed 'Best First'.
Re^4: Working on huge (GB sized) files by salva (Canon) on May 13, 2011 at 17:10 UTC
The problem is the nature of the patterns of usage in the merge phase. Sometimes many records will be read from one temporary file, then many from another, and then another and then (say) back to the first. At other points, the access pattern will skip from file to file for every record. This ugly pattern can be easy eliminated at the application layer reading the records in blocks instead of one by one.	[reply]
Re^5: Working on huge (GB sized) files by BrowserUk (Patriarch) on May 13, 2011 at 19:06 UTC
This ugly pattern can be easy eliminated at the application layer reading the records in blocks instead of one by one. Alleviate to some degree, but not eliminate. And not so much as you might think. The fact is that even when the application layer reads records, (unless you take extraordinary steps to prevent it) the CRT reads blocks and buffers them in the process. But this has almost no affect upon the file system caching which still caches the blocks it supplies to the CRT and will still try to apply heuristics to that caching. For example, if the block(s) numerically following the requested block happen to be contiguous on disk, the filesystem might (for example) choose to read ahead and cache those following blocks assuming sequential access for the file. But, if those read-ahead following blocks only contain records much further into the file than the current merge position, then they can potentially occupy cache slots that would have been better used caching blocks from other files. Thereby causing more cache-thrash and forcing more re-reading than would otherwise be the case. For NTFS (And HPFS. And probably the more advanced nix filesystems; provided they are being used.), the application can specify creation/open flags like: FILE_FLAG_NO_BUFFERING, FILE_FLAG_RANDOM_ACCESS, FILE_FLAG_SEQUENTIAL_SCAN, FILE_FLAG_WRITE_THROUGH etc. as hints to the OS/filesystem drivers about how the application will use this file. But again, these only apply to a single file, not a set of co-active files. So specifying (say) FILE_FLAG_SEQUENTIAL_SCAN for each of the temporary files in the merge stage of a sort, may backfire because it doesn't take into account that access will be random amongst those sequentially read files. As such the hints may override the heuristics that might otherwise benefit the process. Whilst, application and/or CRT level block caching may benefit sequentially-read, but randomly accessed reads slightly more than purely randomly-read files, that only affects 1/8th of the I/O operations involved in the sort&merge approach. And given that that approach requires nearly 3 times as many I/O operations than hash&memory, the overall impact of that slight benefit is entirely minimal. Whilst the block-reading of records will reduce the number of actual disk accesses, to (say, assuming an arbitrary 8 records per block) from 16N IOPs to 2N IOPs for sort&merge; it will also reduce the actual reads for hash&memory from 6N IOps to 0.75N IOPs. Better for both algorithms, but still a ratio of 2.75:1 in favour of the latter. (Still ignoring the 2 P * O(N/P * log N/P ) .v. O(N) difference.) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Working on huge (GB sized) files by Marshall (Canon) on May 13, 2011 at 19:22 UTC
Good discussion. I agree there is certainly some guess work involved in this. Disk-based sorts get around the memory problem by reading a chunk of the source, sorting it and writing it to a temp file; read another chunk, sort and write to temp. Yes, agree. Whilst the reading and writing of the chunks are sequential, they are interleaved, and that alone has a substantial performance impact over reading the whole sequentially and then writing the whole sequentially. I am not so sure about this. I guess that what you mean by interleaved is that program switches back and forth from read to write. Typically the huge input file would be read sequentially, but by reading big hunks (maybe thousands of records at a time). Sort. Then write big hunk to a new file. So data gets read once and written once. In the create hash table scenario, the data only has to be read once - there is no write. Presumably the data is also read in big hunks, say same size hunk as above. The system will certainly allow other I/O processing right after the "big block" request has been satisfied. The system probably is also interleaving other disk I/O requests while the "big block" operation is proceeding. That's true in either scenario above. So when 2nd read needs to happen in either case, the disk is most likely somewhere else and not sitting right at next block to be read. I see a 2x difference, which is "substantial". I don't know if you meant that it was a lot more than that or not. There are of course all kinds of things that can be happening cache wise. Not clear how much if any of the write data will be kept once it is flushed to disk as the file presumably is opened as write only as a big clue. In both scenarios there could also be read-ahead happening. I think the big difference is that in the hash table scenario, each disk operation is going to be getting a random record, maybe 2 different records from different parts of the file. Whereas, the merge is going to be getting thousands of sequential records and does this when one of its n queues runs dry. So it looks to me like the hash approach has the potential for generating many more disk operations than the sort/merge - enough to swamp this initial 2x I/O difference. How all of this is going to work out is very dependent upon what system it is running on, how the caching is set up and what other processes are doing at the same time, how much of the total system resources this collation process is able to use... I think this is going to come down to a YMMV situation. And although the sort is a Nlog N operation versus an N operation for scanning and building the hash, the whole job is going to be dominated by disk I/O time. The CPU operations won't matter even if there is a heck of a lot more of them because they are so fast compared with the disk. I suspect the hash approach uses less actual CPU time, but takes more "wall clock" time.	[reply]
Re^5: Working on huge (GB sized) files by BrowserUk (Patriarch) on May 13, 2011 at 20:14 UTC
. So data gets read once and written once. No. You forgot that the sort needs to merge back the temporary files in to the sorted output file. So each record is read once sorted and written once to a temporary file. It is then read again from that temporary file during the merge and written to the sorted output file. So that two reads and two writes. And the same has to be applied to the second file. And then both sorted files have to be read, merged and written to the final merged file. I think the big difference is that in the hash table scenario, each disk operation is going to be getting a random record, maybe 2 different records from different parts of the file. Whereas, the merge is going to be getting thousands of sequential records and does this when one of its n queues runs dry. Again, no. You seem to be assuming that the merge consists of: reading temp file 1 & writing it to merged file; reading temp file 2 & writing it to merged file; and so on. It doesn't. It requires reading the first record (or block) from each of the temporary files; comparing the records frm all the files against each other and selecting teh lowest and writing it. Reading the next record from that file (or cache); repeat until the end of all the files. There are no "thousands of sequential records". There may be occasional runs of sequential records from a given file, but each of those records needs to be compared against all of the next records from each of the other files before it is know whether it is sequentially the next record to be written or not. And sometimes, there will be runs where no two sequential records in the output file, comes from the same input file. It is exactly this entirely unpredictable, entirely random access pattern that so badly affects caching heuristics, combined with the nearly three times as many IOPs that makes sort&merge at least an order of magnitude slower than hash&memory, all else being equal. Please note. That isn't based on theoretical guesswork, but actual tests. And not just once. And not just recently. But many times going back many years. Almost exactly 30 years ago, we sped up the merging if 5 sets of 6 million records using twin PDP-11/60 and (for the time, huge) "ganged" RK07 28MB disk packs, by 11 times using exactly this technique. Because the data was bigger than the combined disks we could have on line, by merging the key-fields and record offsets in memory, by loading each of the input disk packs once in turn, then the output pack once, rather than having to constantly swap input packs with temporary packs with the input packs again, then the output pack; as with the previous sort all 5 and then merge process. And I went through the same comparative process using real files, disks and data just a couple of months ago in response to a post right here in this site. And a couple of years before than on my previous machine. And it is neither theory nor speculation when I tell you that the differential has never been below an order of magnitude, and because I/O speeds have stagnated, whilst cpu speed have risen exponentially (until recently), and RAM sizes have and continue to grow exponentially, the ratio of performance gain of the hash&memory over sort&merge is increasing. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^6: Working on huge (GB sized) files by Marshall (Canon) on May 13, 2011 at 22:47 UTC
Apparently we have different assumptions about the number of temporary files needed, the size of those files, the number of records which can be read from a file in one block. And then both sorted files have to be read, merged and written to the final merged file. The data has to be read twice and written twice. The merging into to final form happens during 2nd write. Now of course if this data set is so humongous that multiple merge passes are required (too many files to do it at one time), then that's going to take more RW cycles! Depends upon how humongous the data set is. My assumption is that all temp files can be merged in one operation. X records from each file are queued in memory - just select lowest record of the top of all of them. repeat until one queue runs dry. recharge with another 100MB of data or however much fits in each queue, etc... The key here is that issuing one read for a bunch of records is way faster than many trips through read. Of course the problem here is that last record may not be complete and the code has to deal with that. Read cache is most likely counter productive and what I'd want is the memory mapped I/O straight from the disk. If size of sort block is 600MB, size of data is 3GB. A simple approach results in 5 temp files (600 MB each). That's small enough number that say merge queues are 100MB, all 5 files open at once...requires 30 block reads (100MB) operations. That's what I meant by sequential blocks - the files to merged would be read not record by record but by a series of records, what I called a "block". Number of records in block is unknown depends upon data. Reading the next say 100MB of data from a file is more efficient than seeking around trying to pick off 1KB pieces at random places. Of course all of these parameters matter. If thing are huge and we get into say 50 temporary files, things are going to slow way down because more merge passes are required. Neither of us know what any of these numbers really are in the OP's application. And all of this has to do with how smart or not smart the system sort is on the OP's system. I wouldn't recommend that the OP recode it. For all I know the OP has a .csv file and just needs to type in the proper one liner to sort his data. I think we've got the major issues out in the discussion.	[reply]
Re^7: Working on huge (GB sized) files by BrowserUk (Patriarch) on May 13, 2011 at 23:18 UTC