Re^2: Configurable IO buffersize?
by BrowserUk (Patriarch) on Jul 31, 2011 at 11:20 UTC
|
Since 5.14, it's 8k and configurable when Perl is built.
For my current project I need to read from up to 100 files concurrently.
I've demonstrated that on Windows, when reading a single file, using 64k reads works out to be most efficient. I've also proved to myself that when processing input from multiple files concurrently (interleaved), that using even bigger read sizes reduces the number of seeks between file positions and can give substantial gains.
Compile-time configuration doesn't really cut it. Would you use a module that required you to re-build Perl?
You could surely use tie to make a read use sysread.
Indeed, I've been hand-coding sliding buffers with adaptions to specific usages for years, but I thought I saw mention of a module that would allow all the usual line-oriented usage of filehandles, whilst sysread/syswriteing configurable sized chunks from/to disk.
I can write one, but writing a fully-fledged, all-singing/dancing generic module takes a lot of time and thought. I'm surprised it doesn't already exist.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
| [reply] |
|
|
Have you ever thought of using memory mapped files ...
For other things yes, but not for this. I'm not sure how that would work for this application? Especially as the files are text files and are read/written in terms of lines.
The scenario is that (at different stages in the application), I'm interleaving either reads from, or writes to, many files. This has the affect of causing the disk head to need to dance all over the disk whenever a 4k buffer is used up. If I can buffer (say) 1MB at a time, that will cut the number of disk seeks by 1/256, which should have a significant affect on throughput.
If I use memory mapping I have to bypass the OS/CRT/PERLIO line handling, and by moving that into Perl I think I would loose more than I might gain.
Configurable CRT buffer sizes avoids that problem. Maybe the right solution is to write a module that gives back direct access to the CRT stdio on all platforms.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
Would you use a module that required you to re-build Perl?
You don't need a module that requires you to re-build Perl, you want a tweak Perl so it works better for your special needs, and it's actually very common to tweak productions systems by rebuilding components instead of using out of the box settings.
Note: I'm not defending the lack of ability to set this more conveniently.
| [reply] |
|
|
you want a tweak Perl so it works better for your special needs,
Reading and writing multiple files concurrently is hardly "special needs", but conversely, it is hardly likely to constitute the only usage pattern at any given installation.
You'd have to pessimise one mode of operation in order to optimise the other, and that's just not viable.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
I don't know if this approach is applicable to your situation or not, but it sounds like performance is important enough that a lot of hassle might be ok. If true, then I would try adjusting things such that the file system will always read a minimum of 64KB no matter what.
The way to do this is by adjusting what Microsoft calls the cluster size, what other vocabularies call the extent size. This the smallest unit of storage that NTFS will read/write on the disk and it will be contiguous. Doing this requires that you make a special logical drive and format it using the /A: option to the format command:
FORMAT <drive>: /FS:NTFS /A:<clustersize>
clustersize = 65536, that is the maximum size
So this drive is used like any other, except that every file it on it will take a minimum of 64K of space on the disk (even for a 1 byte file).
I have not benchmarked this on Windows NTFS, but I have on other OS/ file systems. I predict significant performance gains.
| [reply] |
|
|
The way to do this is by adjusting ... the cluster size,
Whilst this approach might actually benefit my application to some extent, it would -- even more so than re-compiling perl to use bigger buffers -- be an extremely heavy handed way of achieving those gains.
Reconfiguring the file system to benefit one application without considering the affects of that change on every thing else would be a very drastic step. For example, the OS uses 4k pages of virtual memory, and backs those virtual pages with clusters of memory mapped physical disk. (All executable files are loaded using memory mapped IO.). What would be the effect of having 4k virtual pages backed by 64k disk clusters?
But in any case, the scenario I describe is not so unusual, nor something unique to my machine. Think of every time you use an external sort program on a huge file. These work by first reading the source file sequentially and writing partially sorted chunks to temporary files, then merging those temporary files. In the second stage they are interleaving reads from multiple source files and writing to the output file. The exact scenario I described above.
Perl's limitation on setting the buffer size used for buffered IO on a file-by-file basis is a real and distinct limitation.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|