in reply to Re^2: OT: Locking and syncing mechanisms available on *nix.
in thread OT: Locking and syncing mechanisms available on *nix.

Well, I can't speak for Windows (which, I gather from your posts over time you are an expert in), but for Linux, I believe you are going to find the scheduler manipulation is going to be more overhead than the signaling you choose. If you really only want a thread to get to a certain point, and then wait until it is told to continue, then you can use suspend/continue. The first thread can call pthread_suspend on itself, which will take it off the run queue (so it no longer takes any CPU). Then the second thread can use pthread_continue to 'signal' it to continue. Your only overhead will be manipulation of the queues in the scheduler, which will have to happen regardless of your mechanism.

fnord

  • Comment on Re^3: OT: Locking and syncing mechanisms available on *nix.

Replies are listed 'Best First'.
Re^4: OT: Locking and syncing mechanisms available on *nix.
by BrowserUk (Patriarch) on Mar 28, 2011 at 08:20 UTC

    Usage scenario. Most of the time, the producer will be running on one core and the consumer on another, and they will producing and consuming from their respective ends of the shared memory structure as fast as they can go. No locking; no synching; no (elective) context switching.

    Occasionally, one end or the other will get preempted for some higher priority thread. At this point, the shared data structure will become either full, or empty depending upon which end is still running. At that point, that end needs to enter a wait state until the other end gets another timeslice, does its thing, relieving the empty or full state and waking up the other end to continue.

    Most of the time, given a correctly sized, and well-written buffering data-structure, the above scenario is both lock-free, wait free and requires no system calls (ring3/ring0/ring3 transitions). Both consumer and producer threads are free to run as fast as their processing requirements allow them and utilise their full timeslices. The latter point is the key to maximum utilisation.

    If I use suspend/resume, buffer empty/full conditions are guaranteed to not only require a multiple calls into the kernel, but also (at least one) very expensive context switch. If I use cond_vars and (unneeded) kernel mutexs, this also means an expensive call into the kernel for every read & write.

    The whole point of lock-free & wait-free algorithms is that they avoid both: expensive calls into the kernel; and expensive elective context switches--ie. non-pre-emptive ceding of the cpu--in order to make full use of each time-slice allotted.

    The point of Fast, user-space mutexes is that they run in user-space, and are therefore faster.

    The (lock-free/wait-free) algorithms are getting better and better defined. The hardware support (CAS, XCHG and similar SMP atomic instructions) is getting better and better with every new generation of processors.

    The limitations are currently locking, syncing and signalling mechanisms designed for single-processor/core IPC purposes. Given that much of the HPC research is done on *nix boxes of one flavour or another, I know there are better mechanisms out there. This thread was meant to be about enlisting help to find them, not argue about whether they are possible, or even required.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      One last note and I'm done. I'm not sure which half of the internet you are looking at regarding mutex's, but there is a lot of stuff that comes up on searches that is fairly old. The Linux thread implementation was completely rewritten for 2.6 (about 6 years ago). I did not have a chance to look at the conditional var code, but I looked at the mutex code for more recent libc releases (2.12), and pthread_mutex_lock runs completely in user-space unless contention occurs. Only then does it go to the kernel to block the thread. You can also use pthread_mutex_trylock to see if contention would occur, and this also runs completely in user-space. This is mentioned in the NPTL white paper on p8-9.

      The futex API is exposed for these implementations, but filled with caveats about 'you better know what you are doing'. This approach would limit your compatibility to Linux.

      Lastly, I found this list of links to implementations of the kind of algorithms you are looking for. I took a look at the atomic_io in qprof, and it looks fairly complete, if primitive.

      It is likely that all of these will give you portability issues if you really need *nix flexibility

      fnord

        First off. Many thanks for all the links. I will read them avidly.

        With respect to out-of-date material. I am concious that I wouldn't even know without lookingit up, what is the current version of anything *nix. So, dates and versions go right over my head. In part, that's the reason for seeking expert advice.

        but I looked at the mutex code for more recent libc releases (2.12), and pthread_mutex_lock runs completely in user-space unless contention occurs.

        That's cool, and a great advance. But it still means acquiring a lock that I don't need or want, for every iteration at both ends. It may be that with modern mutexes and low contention, that is insignificant, but it still dictates a certain algorithmic logic that is effectively the reverse of what I actually need.

        It means acquiring the mutex every time, and signalling every time in case someone else was waiting on it. And doing that at both ends every time.

        The theory of my design is that you block or signal if you hit one end of the buffer or the other. The rest of the time, no blocking or signalling is required. Ultimately, the notion is to have the code monitor itself for the frequency of under/over-runs, and increase the size of the buffer (semi-automatically within bounds) if they are too frequent.

        If the buffer was unlimited, then everything can be added, and everything removed without ever encountering a block except at start-up and close-down. That's impractical for real world systems, but so long as the producer produces just slightly faster than the consumers consume, that nirvana is theoretically achievable within practical memory bounds. Of course, there will always be some other threads that will intervene and upset the apple cart, but the hope is to minimise them, not eliminate them completely.

        I'm also looking at starting with a brief spin loop falling over to a (context switch inducing) micro-sleep to save cpu. In theory this can reduce elective context switches by an order of magnitude or more.

        The other part of the reason for asking here, is to find actual implementations for reference. Whilst using MS Event objects works quite well, I'm still seeking to do better. This isn't an urgent (for me) problem to be solved, but part of an ongoing research into better mechanisms.

        If there is a widely ported, reliable library available, then it makes sense for me to try and use/crib from that for my Windows implementation. Both because it is likely quicker for me to benefit from others labours; but also because if this ever makes it to release, it would potentially make porting it to (some) *nixes easier should anyone care to take on that task.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.