in reply to Re^5: Algorithm advice sought for seaching through GB's of text (email) files
in thread Algorithm advice sought for seaching through GB's of text (email) files

Due to various sources of latency, one process generally comes nowhere close to maxing out I/O

Hmm. See threads and multiple filehandles for the most recent example were this has been shown to not be the case.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^6: Algorithm advice sought for seaching through GB's of text (email) files
  • Download Code

Replies are listed 'Best First'.
Re^7: Algorithm advice sought for seaching through GB's of text (email) files
by tilly (Archbishop) on Sep 25, 2006 at 05:04 UTC
    Huh. I have never personally seen anything like that happen, and a few years ago I was doing a lot of I/O bound processing. The things that I can think of that might have been different are that I was working with lots of small files and generally had to be both reading and writing, plus this was a few years ago so there was different hardware and different operating systems.

    I don't know what made the difference, but apparently it is possible for one process to max out I/O. However I still haven't personally seen it happen.

      Sorry to labour the point, but having cast around for references I can't see how parallelizing an IO-bound process could ever reap performance benefits on "commodity hardware"--defined for the purposes of discussion as a single cpu system with a single harddrive.

      If one process is IO-bound, then by definition, it spends most of it's time waiting for the OS kernel to complete it's IO requests. Ie. The route through the kernel IO routines, device driver and disk drive hardware is the limiting factor.

      If you start a second copy of the process, then it will spend it's time waiting for it's IO requests to complete, but it will also have to wait for the IO chain to complete the first processes IO requests before it gets around to attempting to service those from the second process.

      The bottleneck, wherever in the IO chain it falls, will not suddenly widen because a second process starts making requests. The only way that could happen is if the kernel held some percentage of it's potential throughput 'in reserve' for second and subsequent processes.

      That's not a completely rediculous idea. Certainly some network protocols do something akin to this. SNA for example would resist allocating all the bandwidth of any given point to point link to a single end-to-end connection and would attempt to always hold some bandwidth in reserve for low-volume high priority traffic. Years ago, I remember breaking 1.4MB diskette images into smaller chunks for transmission across SNA networks as smaller transfers were always given priority over larger ones. But I've never heard, nor found reference to any dd or bus management system that does anything similar.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I've already told you that I have personally seen it happen.

        Are you implying that I am lying? That I could not have seen what I already said I did?

        FYI just before I went to bed last I decided to google this exact subject, and I ran across http://kerneltrap.org/node/452. The point that was made there about kernels of the same vintage as the ones that I was working with was, The current readahead model is per-inode, which is very little help with lots of small files, especially if they are fragmented or out of order.

        This is extremely relevant because I was working with small files of exactly that kind. If readahead works for you then latencies do not matter because the next piece of data you want has always been fetched for you before you asked for it. But if readahead does not work for you, then you will wind up waiting for the round trip time between when requests are sent to disk and the disk responds to you.

        So your comments about throughput are based on an assumption of a situation where the OS is able to anticipate your future requests before you make them, while my comments about latencies are based on an assumption that it is not. Your assumption is more likely to be true today. Mine certainly was true for me in the situation where I last cared to benchmark this.