in reply to Advice needed on an interesting read-ahead over network file IO problem

Each of these requests in an uncached world would mean one hit for a probably tiny chunk of data on the remote server side, resulting in quick and useless concurrent hits. Thus, I've put a simple caching mechanism into the loop

But you're not in an uncached world. Perl already does buffering. It's a small buffer (4k), so you might still want to do your own buffering (although you should use probably use sysread if you do).

Now reading begins, and depending on the application doing the read() it asks for chunks of 1, 46, 1024, whatever bytes of data - efficient when done on local disk, inefficient over network.

No, read returns data from the handle's buffer. If the buffer doesn't hold enough data, read requests 4k chunks until it has the requested amount of data or an error occurs.

This differs from sysread. sysread ignores what's in the buffer, requests the amount of bytes you specified from the OS, and returns whatever the OS returned (which may be less than the requested amount even if no error occurs).

The problem in this solution is that the local script needs to be able to read() and seek() in a data structure

Perl's does read-ahead buffering. It won't help if you seek around. (Maybe if you seek a little forward, but that's it.) You still need a solution for that.

Update: Added last para.

Replies are listed 'Best First'.
Re^2: Advice needed on an interesting read-ahead over network file IO problem
by isync (Hermit) on Mar 16, 2010 at 20:01 UTC
    Mmh. Don't know what I make of this. Actually my script does not use a read(), it does more like a sysread, as the function to read() is abstracted more like this pseudo code $data = $lwp->get->content( offset=> xyz, length => abc);

    The server side does a sysread().
      I can't comment on an abstraction I haven't seen, but LWP uses sysread for http urls.
      If your communication runs on top of HTTP, ensure that you are not creating a new TCP connection for every request but using a persistent one.

      Regarding read ahead, the usual practice is to set a minimum chunk size, say 64k. Any read below this size reads 64k, returns the required length and caches the remainder data for future requests.

      The optimal minimum chunk size to use depends on the network characteristics (bandwidth and latency). Just experiment to find it out!