in reply to Re^2: Working on huge (GB sized) files
in thread Working on huge (GB sized) files
First, I mostly agree with everything you've said, hence the first line of my OP above: "More of a guess given your lack of details,".
But there are a couple of points worth picking up on:
There can be trouble with step 3, because it involves seeking around within the files to gather up the records - reading randomly selected records usually takes a lot longer than reading sequential records.
Agreed. But this is often more than balanced out by the merge steps of disk-based sorts.
Disk-based sorts get around the memory problem by reading a chunk of the source, sorting it and writing it to a temp file; read another chunk, sort and write to temp.
Whilst the reading and writing of the chunks are sequential, they are interleaved, and that alone has a substantial performance impact over reading the whole sequentially and then writing the whole sequentially.
But of far bigger significance is the merge phase of the temporary files. Whilst the reading of records from a given temp file is sequential, collectively they are random. This has at least as much impact (disk seeks) as reading a single file randomly. But it also has a secondary impact, that of forcing the file-cache to juggle multiple active files, and the impacts of that are far harder to predict and can be substantial.
The problem is the nature of the patterns of usage in the merge phase. Sometimes many records will be read from one temporary file, then many from another, and then another and then (say) back to the first. At other points, the access pattern will skip from file to file for every record.
This randomly changing pattern of usage not only confuses most intelligent caching algorithms by continually switching from mostly sequential to mostly random and all points in between. The confusion can actively work against performance by (say) flushing newly-read, part-accessed blocks from the cache prematurely because previous access patterns have suggested that the access is random and so the rest the block will not be used soon. Or conversely, retaining part-accessed blocks far longer than is good for the rest of the system because access patterns suggest sequential access and so it will be needed soon.
Most (if not all) caching heuristics only consider a single file at a time, even when multiple files are being accessed by the same process, so in effect, the multiple open files of the merge phase of an external sort act as if competing processes were at work.
So the sequential reading of records randomly distributed amongst a number of temporary files is at least as costly, and usually much more so, than the random access of records from just two files. And remember that in the sort&merge approach, there are three merge phases, not one as with the hash&memory method.
The "wall clock" time of a seek operation and rotational delays is in msec. One of the sort comparisons is in nsecs. Thousands of comparisons can be performed in the time it takes to do one seek.
Yes, but ... :)
With the sort&merge, every record of both files have to be read from and written back to disk twice each during the two sorts. Once when sorting the chunks; once when merging them. and then each record has to be read and written a third time when merging the two sorted files.
Assuming (for simplicity) equal sized files:
that is 8N reads and 8N writes; with 2N reads being effectively random.
There are 4N reads and 2N writes; of which 2N reads are random.
So, even discounting the {2 * P * O( N/P log(N/P)} sorting .v. {O( 2N )} hashing differences, that's { 6N-SR + 2N-RR + 8N-SW } versus { 2N-SR + 2N-RR + 2N-SW }. That's a lot of wall time to be made up.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Working on huge (GB sized) files
by salva (Canon) on May 13, 2011 at 17:10 UTC | |
by BrowserUk (Patriarch) on May 13, 2011 at 19:06 UTC | |
|
Re^4: Working on huge (GB sized) files
by Marshall (Canon) on May 13, 2011 at 19:22 UTC | |
by BrowserUk (Patriarch) on May 13, 2011 at 20:14 UTC | |
by Marshall (Canon) on May 13, 2011 at 22:47 UTC | |
by BrowserUk (Patriarch) on May 13, 2011 at 23:18 UTC |