Portability of Large read Lengths

johnmyleswhite has asked for the wisdom of the Perl Monks concerning the following question:

Lately I've been using PDF::API2 within a CGI script that handles the uploading and processing of PDF files. Because my program ran so slowly on the antiquated box I have at work, I started to run the script with Devel::Profiler on. After a single run, I found that the vast majority of the program's running time was spent making call after call to IO::Handle's read.

After looking at the source for PDF::API2, I found that each read call was pulling in only 512 characters at a time. I made an alternative version that pulled in 65536 characters instead and found that my program ran more than four times as fast.

With that observation in mind, I wrote to the module's author, suggesting that the LENGTH used by read calls be an option that could be set by the user of the module. He responded by noting that 512 bytes is the POSIX limit on file blocks and that, therefore, a larger size cannot be guaranteed to work on all systems.

I'm interested in knowing how many types of systems this could be a problem on, because almost all of the Perl programs I've seen written these days contain read calls fetching more than 512 characters at a time. My understanding is that the Perl read calls are translated into as many stdio fread(3) calls as necessary. Is this not true? And is my program going to mysteriously fail on a modern platform if I take the risk of using the larger length?

Thanks for whatever details anyone has to offer about read lengths.

-- John

Comment on Portability of Large read Lengths Select or Download Code

Replies are listed 'Best First'.
Re: Portability of Large read Lengths by BrowserUk (Patriarch) on May 26, 2008 at 14:59 UTC
Tell him he's talking out of his ... misunderstanding. 512 is a minimum, not a maximum. In any case, the possibility that an option might nt be usable with other than the default on some obscure ancient system somewhere is no reason not to provide the option for use on the majority of systems where it would be beneficial. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: Portability of Large read Lengths by moritz (Cardinal) on May 26, 2008 at 14:53 UTC
I found three interesting documents about portability, and none mentions such a length restriction. Maybe a compromise would be a default of 512, that can be overridden (simply with a global variable)?	[reply]
Re^2: Portability of Large read Lengths by johnmyleswhite (Initiate) on May 26, 2008 at 17:28 UTC
Thanks to everyone for their comments. The third document you mentioned, moritz, (the one from pwet.fr) actually has some points about the limitations of `read` when a large number of bytes is requested: If the value of `nbyte` is greater than `{SSIZE_MAX}`, the result is implementation-defined. The use of I/O with large byte counts has always presented problems. Ideas such as `lread`() and `lwrite`() (using and returning `long`s) were considered at one time. The current solution is to use abstract types on the ISO C standard function to `read`() and `write`(). The abstract types can be declared so that existing functions work, but can also be declared so that larger types can be represented in future implementations. It is presumed that whatever constraints limit the maximum range of `size_t` also limit portable I/O requests to the same range. This volume of IEEE Std 1003.1-2001 also limits the range further by requiring that the byte count be limited so that a signed return value remains meaningful. Since the return type is also a (signed) abstract type, the byte count can be defined by the implementation to be larger than an `int` can hold. My hope is that this limit is actually quite large on most platforms.	[reply] [d/l] [select]
Re^3: Portability of Large read Lengths (16kB) by tye (Sage) on May 26, 2008 at 20:10 UTC
SSIZE_MAX is defined as the maximum value for the data type ssize_t. So 32k-1 is the smallest it will ever be (since ssize_t is guaranteed to be at least 16 bits). My experience is that using a read buffer of less than 4kB is rarely a good idea. 16kB would make a good default and present no portability problems. Allowing users to specify a much larger read length would also be wise. A read length of 512 on most systems means that you are reading in less than one "block" of data and so special buffering needs to be done which slows down I/O. - tye	[reply]
Re: Portability of Large read Lengths by Corion (Patriarch) on May 26, 2008 at 14:59 UTC
Personally, I imagine that Perl's implementation of read shields us from the vagaries of reads for other buffer lengths. Maybe POSIX's definition of `read` has such length limitations, but if so, I would imagine those being confined to sysread. So making the value of 512 at least configurable seems prudent.	[reply] [d/l]
Re: Portability of Large read Lengths by ysth (Canon) on May 27, 2008 at 03:22 UTC
Since it's best to counter a "POSIX" argument with POSIX itself: IEEE Std 1003.1, 2004 Edition (registration required) (the pwet.fr site seems to be an unofficial copy). -- Online Fortune Cookie Search	[reply]