Semaphore without destruction/unmapping race condition Semaphore without destruction/unmapping race condition linux linux

Semaphore without destruction/unmapping race condition


First of all, let me bring up two alternative approaches you may wish to consider.

  • Approach #1 (X86-specific, fast): CMPXCHG8B/CMPXCHG16B.

    x86 platforms have a double-pointer-width atomic compare-and-exchange operation. On 32-bit this is 8 bytes; on 64-bit there's a CMPXCHG16B that atomically compares and exchanges a full 16 bytes of data. By using this you can atomically swap both wait count and semaphore count in a single operation. futex can only wait on one pointer-size field, but this shouldn't be a severe problem in this case.

  • Approach #2 (portable, limited): Packed counts.

    If a limit of 2^16 for waiter and semaphore counts is acceptable, simply pack both counts in a single 32-bit field.

  • Approach #3 (portable, has some overhead): Use a semaphore-in-a-semaphore to protect the post race.

    Reserve 8 bits of the semaphore count for a lock over post operations. The post operation will increment this counter (blocking if it would overflow) at the same time as it increments the true semaphore count. It will then do its work with the waiters field, and then decrement the lock counter atomically. After decrementing, if the previous value was 255, wake up all waiters (although this causes a thundering herd, it should be extremely rare).

    Upon deletion, acquire the lock 255 times (you can increment by more than one in one step), blocking as necessary. Once the lock has been acquired 255 times, you know that all posts have completed, and it is safe to delete the lock.

    Downside: Posts now require two atomic compare-exchanges, and the maximum semaphore count is 2^24-1. Additionally, recursively entering an asynchronous signal handler 255 times will deadlock.

These approaches are simpler, easier to prove correct, and likely faster. However their limitations may mean they are unacceptable for your case (the CMPXCHG8B approach should work quite well on x86 however). And here's one more:

  • Approach #4 (kinda-arch independent; complex; fast): Modify the kernel

    One option here would be to modify the kernel to allow for a low-overhead, safe method for reading the waiter field without resulting in a segfault in the event of the memory being freed. For example, you could add a syscall that registers a thread-specific data structure; in that thread-specific data page you could have a 'fault handler jump address'. In the event that the program segfaults, if the jump address is nonzero, the kernel simply jumps there instead of raising SIGSEGV. Alternately, you could have a similar mechanism to simply suppress the offending instruction.

    Now all you have to do is:

    • At libc init and thread startup, register these thread-specific structures, and save a pointer to them in TLS data
    • In post, arrange for faults to be suppressed around the waiter count. If a fault does occur, don't do the wakeup (the wakeup is harmless if the memory in question is reused for a different purpose)

    And there you go - you get the fast algorithm, but you also get protection against the deletion race. But you have to hack the kernel's segv handlers to do it. It might be worth looking at SEH on Windows; a similar mechanism would work very well here.

In any cast, I don't see anything wrong with your approach offhand, but I may be missing something. It might be good to raise it on appropriate mailing lists, and to consult with the futex maintainers; they would probably be interested in implementing support in the kernel to make this easier for you.