in reply to Re^4: Algorithm advice sought for seaching through GB's of text (email) files
in thread Algorithm advice sought for seaching through GB's of text (email) files

At some point you are right, contention becomes an issue. However I have found as a practical matter on commodity hardware that that point isn't hit until you have 5-10 parallel processes reading.

Of course I last looked into this a while ago. The figure is undoubtably different today. Benchmark for your purposes.

But my point remains. I/O bound jobs are excellent candidates for parallelization. Due to various sources of latency, one process generally comes nowhere close to maxing out I/O. If you don't have to worry about getting in other people's way, it almost always makes sense to have multiple parallel processes.

  • Comment on Re^5: Algorithm advice sought for seaching through GB's of text (email) files

Replies are listed 'Best First'.
Re^6: Algorithm advice sought for seaching through GB's of text (email) files
by BrowserUk (Patriarch) on Sep 24, 2006 at 21:18 UTC
    Due to various sources of latency, one process generally comes nowhere close to maxing out I/O

    Hmm. See threads and multiple filehandles for the most recent example were this has been shown to not be the case.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Huh. I have never personally seen anything like that happen, and a few years ago I was doing a lot of I/O bound processing. The things that I can think of that might have been different are that I was working with lots of small files and generally had to be both reading and writing, plus this was a few years ago so there was different hardware and different operating systems.

      I don't know what made the difference, but apparently it is possible for one process to max out I/O. However I still haven't personally seen it happen.

        Sorry to labour the point, but having cast around for references I can't see how parallelizing an IO-bound process could ever reap performance benefits on "commodity hardware"--defined for the purposes of discussion as a single cpu system with a single harddrive.

        If one process is IO-bound, then by definition, it spends most of it's time waiting for the OS kernel to complete it's IO requests. Ie. The route through the kernel IO routines, device driver and disk drive hardware is the limiting factor.

        If you start a second copy of the process, then it will spend it's time waiting for it's IO requests to complete, but it will also have to wait for the IO chain to complete the first processes IO requests before it gets around to attempting to service those from the second process.

        The bottleneck, wherever in the IO chain it falls, will not suddenly widen because a second process starts making requests. The only way that could happen is if the kernel held some percentage of it's potential throughput 'in reserve' for second and subsequent processes.

        That's not a completely rediculous idea. Certainly some network protocols do something akin to this. SNA for example would resist allocating all the bandwidth of any given point to point link to a single end-to-end connection and would attempt to always hold some bandwidth in reserve for low-volume high priority traffic. Years ago, I remember breaking 1.4MB diskette images into smaller chunks for transmission across SNA networks as smaller transfers were always given priority over larger ones. But I've never heard, nor found reference to any dd or bus management system that does anything similar.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.