When are lock free data structures less performant than mutual exclusion (mutexes)? When are lock free data structures less performant than mutual exclusion (mutexes)? multithreading multithreading

When are lock free data structures less performant than mutual exclusion (mutexes)?


A common approach to implementing a lock-free data structure is to have a mutable reference to an immutable object, and have anything that wants to change the structure grab the reference, produce a new version of the object with suitable changes applied, and then CompareExchange the reference to point to the new object. If the CompareExchange works, great. If not, ditch the new object, re-grab the reference, and start over.

This can work well if producing the new object is cheap and the level of contention is low enough that the CompareExchange will usually work. If there is considerable contention, and if producing the new object is slow, simultaneous attempted updates by N threads may take N^2 time to complete. As an extreme example, suppose 100 threads are running on a CPU, an update takes 100ms of CPU time (just over a time-slice), and all 100 threads attempt to update an object at once. During the first ten seconds, each thread will produce a new object based on the original one. One of the threads will successfully do the CompareExchange, while the others will all fail. Then during the next 9.9 seconds, 99 threads will generate new versions of the object, after which one will successfully post its update and 98 will fail. The net effect will be that the lock-free method will take 505 seconds' worth of CPU time to perform 100 updates, when a locking system could have done them in about 10 seconds.


lockless data structures will, one way or another, use atomic semantics from your architecture to perform its core operations. When you do this, you can be using the machines entire internal exclusion mechanisms to ensure correct ordering or fencing of data. A mutex or critical section also does this, but it only does it once for a single flag. Where the mutex or critical section is slow, is when the the lock acquisition fails (there is contention). In this case, the OS also invokes the scheduler to suspend the thread until the exclusion object has been released.

So it seems logical that whenever your lock-less data structure uses multiple atomic operations per core method when a single lock shielding a critical section could provide the same semantics AND there tends to be very little contention, in practice, for the data structure in question, then in fact, it does make more sense to use an OS-provided locking mechanism, than trying to build your own.


I don't know about making it slower, but it certainly makes it harder to get right. In the many cases where the two approaches are virtually identical in performance (or when it simply doesn't matter if it takes 500 pico-seconds rather than 100 pico-seconds), then pick the simplest approach - generally lock.

There are very few cases when that extra bit of performance is key; and if it is, I suspect you'd do well to use the pre-rolled pattern implementations from established libraries. Getting lock-free code working properly (and proving that it works properly in all conditions) is often very hard.

Note also that some environments offer a level of locking above the OS-provided mutex; mutex behaviour, but without some of the overheads (for example, Monitor in .NET).