Re^4: Working on huge (GB sized) files

Replies are listed 'Best First'.
Re^5: Working on huge (GB sized) files by BrowserUk (Patriarch) on May 13, 2011 at 19:06 UTC
This ugly pattern can be easy eliminated at the application layer reading the records in blocks instead of one by one. Alleviate to some degree, but not eliminate. And not so much as you might think. The fact is that even when the application layer reads records, (unless you take extraordinary steps to prevent it) the CRT reads blocks and buffers them in the process. But this has almost no affect upon the file system caching which still caches the blocks it supplies to the CRT and will still try to apply heuristics to that caching. For example, if the block(s) numerically following the requested block happen to be contiguous on disk, the filesystem might (for example) choose to read ahead and cache those following blocks assuming sequential access for the file. But, if those read-ahead following blocks only contain records much further into the file than the current merge position, then they can potentially occupy cache slots that would have been better used caching blocks from other files. Thereby causing more cache-thrash and forcing more re-reading than would otherwise be the case. For NTFS (And HPFS. And probably the more advanced nix filesystems; provided they are being used.), the application can specify creation/open flags like: FILE_FLAG_NO_BUFFERING, FILE_FLAG_RANDOM_ACCESS, FILE_FLAG_SEQUENTIAL_SCAN, FILE_FLAG_WRITE_THROUGH etc. as hints to the OS/filesystem drivers about how the application will use this file. But again, these only apply to a single file, not a set of co-active files. So specifying (say) FILE_FLAG_SEQUENTIAL_SCAN for each of the temporary files in the merge stage of a sort, may backfire because it doesn't take into account that access will be random amongst those sequentially read files. As such the hints may override the heuristics that might otherwise benefit the process. Whilst, application and/or CRT level block caching may benefit sequentially-read, but randomly accessed reads slightly more than purely randomly-read files, that only affects 1/8th of the I/O operations involved in the sort&merge approach. And given that that approach requires nearly 3 times as many I/O operations than hash&memory, the overall impact of that slight benefit is entirely minimal. Whilst the block-reading of records will reduce the number of actual disk accesses, to (say, assuming an arbitrary 8 records per block) from 16N IOPs to 2N IOPs for sort&merge; it will also reduce the actual reads for hash&memory from 6N IOps to 0.75N IOPs. Better for both algorithms, but still a ratio of 2.75:1 in favour of the latter. (Still ignoring the 2 P * O(N/P * log N/P ) .v. O(N) difference.) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]

Replies are listed 'Best First'.

Re^5: Working on huge (GB sized) files
by BrowserUk (Patriarch) on May 13, 2011 at 19:06 UTC

This ugly pattern can be easy eliminated at the application layer reading the records in blocks instead of one by one.

Alleviate to some degree, but not eliminate.

And not so much as you might think. The fact is that even when the application layer reads records, (unless you take extraordinary steps to prevent it) the CRT reads blocks and buffers them in the process.

But this has almost no affect upon the file system caching which still caches the blocks it supplies to the CRT and will still try to apply heuristics to that caching. For example, if the block(s) numerically following the requested block happen to be contiguous on disk, the filesystem might (for example) choose to read ahead and cache those following blocks assuming sequential access for the file.

But, if those read-ahead following blocks only contain records much further into the file than the current merge position, then they can potentially occupy cache slots that would have been better used caching blocks from other files. Thereby causing more cache-thrash and forcing more re-reading than would otherwise be the case.

For NTFS (And HPFS. And probably the more advanced *nix filesystems; provided they are being used.), the application can specify creation/open flags like: FILE_FLAG_NO_BUFFERING, FILE_FLAG_RANDOM_ACCESS, FILE_FLAG_SEQUENTIAL_SCAN, FILE_FLAG_WRITE_THROUGH etc. as hints to the OS/filesystem drivers about how the application will use this file. But again, these only apply to a single file, not a set of co-active files. So specifying (say) FILE_FLAG_SEQUENTIAL_SCAN for each of the temporary files in the merge stage of a sort, may backfire because it doesn't take into account that access will be random amongst those sequentially read files. As such the hints may override the heuristics that might otherwise benefit the process.

Whilst, application and/or CRT level block caching may benefit sequentially-read, but randomly accessed reads slightly more than purely randomly-read files, that only affects 1/8th of the I/O operations involved in the sort&merge approach. And given that that approach requires nearly 3 times as many I/O operations than hash&memory, the overall impact of that slight benefit is entirely minimal.

Whilst the block-reading of records will reduce the number of actual disk accesses, to (say, assuming an arbitrary 8 records per block) from 16N IOPs to 2N IOPs for sort&merge; it will also reduce the actual reads for hash&memory from 6N IOps to 0.75N IOPs. Better for both algorithms, but still a ratio of 2.75:1 in favour of the latter.

(Still ignoring the 2 * P * O(N/P * log N/P ) .v. O(N) difference.)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]