Advice needed on an interesting read-ahead over network file IO problem

isync has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks,

I need your advice on how to do a read ahead buffer over the network right.
My situation is this: A local script allows local apps to read a file which appears to the script (and the app) as being local, but in fact the file resides on a remote server and is potentially quite big. It's effectively a virtual file system setup.

Now reading begins, and depending on the application doing the read() it asks for chunks of 1, 46, 1024, whatever bytes of data - efficient when done on local disk, inefficient over network.

Each of these requests in an uncached world would mean one hit for a probably tiny chunk of data on the remote server side, resulting in quick and useless concurrent hits. Thus, I've put a simple caching mechanism into the loop: whenever a local read is done, it caches ahead 14 more chunks of the asked size. For example an app which does tiny reads asks for a first chunk of data, 16 bytes. My cache then multiplies this by 15 and sends just one request over the network for 16 bytes * 15. Upon arrival it delivers the asked for chunck and caches the remaining 14 chunks as the app will quite likely ask for them after it has consumed the previous chunk. (Of course, limiting the number of read-ahead slots by EOF etc.)

This is as good as it gets without additional work. The problem is that this gets more effective if an app asks for reasonable sized chunks, but effectively doesn't help much if it asks for 1byte chuncks!

Imagined version 2
A next iteration of the problem would base the

read-ahead buffer size on file size and a cap-limit, so a scheme would read chuncks of up to, let's say 64000bytes on larger files, or read the whole file in one go if it is smaller. And then would cache this locally.

The problem in this solution is that the local script needs to be able to read() and seek() in a data structure (for example if a video-player skips to a later section in the video) that is possibly in the process of being filled, the seek() might be in a portion of the file which is already there (probably the first few bytes) and yet might move to another section of data which isn't effectively there, should then be delivered next, etc.
A bit like a canister being filled while someone is tapping it on the bottom.

Quiz 1:
Would it be a good idea to start the file-fill reader in a different thread so the local cached copy gets filled asynchroneously (I get a headache...)

Quiz 2:
Would IO::Mark or IO::Stream or IO::File::Cached be of any help here? (I still can't get my head around them..)

Any help, input, advice, code bits welcome!

Comment on Advice needed on an interesting read-ahead over network file IO problem

Replies are listed 'Best First'.
Re: Advice needed on an interesting read-ahead over network file IO problem by ikegami (Patriarch) on Mar 16, 2010 at 19:45 UTC
Each of these requests in an uncached world would mean one hit for a probably tiny chunk of data on the remote server side, resulting in quick and useless concurrent hits. Thus, I've put a simple caching mechanism into the loop But you're not in an uncached world. Perl already does buffering. It's a small buffer (4k), so you might still want to do your own buffering (although you should use probably use `sysread` if you do). Now reading begins, and depending on the application doing the read() it asks for chunks of 1, 46, 1024, whatever bytes of data - efficient when done on local disk, inefficient over network. No, `read` returns data from the handle's buffer. If the buffer doesn't hold enough data, `read` requests 4k chunks until it has the requested amount of data or an error occurs. This differs from `sysread`. `sysread` ignores what's in the buffer, requests the amount of bytes you specified from the OS, and returns whatever the OS returned (which may be less than the requested amount even if no error occurs). The problem in this solution is that the local script needs to be able to read() and seek() in a data structure Perl's does read-ahead buffering. It won't help if you seek around. (Maybe if you seek a little forward, but that's it.) You still need a solution for that. Update: Added last para.	[reply] [d/l] [select]
Re^2: Advice needed on an interesting read-ahead over network file IO problem by isync (Hermit) on Mar 16, 2010 at 20:01 UTC
Mmh. Don't know what I make of this. Actually my script does not use a read(), it does more like a sysread, as the function to read() is abstracted more like this pseudo code $data = $lwp->get->content( offset=> xyz, length => abc); The server side does a sysread().	[reply]
Re^3: Advice needed on an interesting read-ahead over network file IO problem by ikegami (Patriarch) on Mar 16, 2010 at 20:07 UTC
I can't comment on an abstraction I haven't seen, but LWP uses `sysread` for http urls.	[reply] [d/l]
Re^3: Advice needed on an interesting read-ahead over network file IO problem by salva (Canon) on Mar 17, 2010 at 09:31 UTC
If your communication runs on top of HTTP, ensure that you are not creating a new TCP connection for every request but using a persistent one. Regarding read ahead, the usual practice is to set a minimum chunk size, say 64k. Any read below this size reads 64k, returns the required length and caches the remainder data for future requests. The optimal minimum chunk size to use depends on the network characteristics (bandwidth and latency). Just experiment to find it out!	[reply]
Re: Advice needed on an interesting read-ahead over network file IO problem by NetWallah (Canon) on Mar 16, 2010 at 19:36 UTC
Something as simple as Memoize could get you amazing performance benefits using minimal programming cycles. Theory is when you know something, but it doesn't work. Practice is when something works, but you don't know why it works. Programmers combine Theory and Practice: Nothing works and they don't know why. -Anonymous	[reply]