Cause of involuntary context switches Cause of involuntary context switches multithreading multithreading

Cause of involuntary context switches


You mentioned there is 32 cores but what is the exact layout of the hardware? E.g. how many packages the machine has, how many cores, how the cache is shared etc. For sharing this kind of information I personally like sharing the output of likwid-topology -g.

Anyway, there is one piece of non-determinism in your run: thread affinity. The operating system assigns the SW threads to run on specific HW threads somehow without taking into account the knowledge about how the threads communicate (just because it doesn't have that knowledge). That can cause all kinds of effects so for reproducible runs it's a good idea to make sure you pin your SW threads to HW threads in some way (there may be an optimal way, too, but so far I am just talking about determinism).

For pinning (a.k.a. affinity) you can either use explicit Pthread calls or you might try another tool from the Likwid suite called likwid-pin - see here.

If that doesn't get you consistent results, run a good profiler (e.g. Intel VTune) on your workload making sure you capture a faster run and a slower run and then compare the results. In VTune you can use the compare feature that shows two profiles side by side.


I believe your problem is actually a scheduling issue.

One cannot avoid one's process from being pre-empted from the CPU, but the problem is that if a thread is preempted, and then on it's next quantum ends up on a different CPU, or to be more specific on a CPU with a different L2 cache, then it's accesses to memory will all yield cache misses and cause the data to be fetched from memory. On the other hand, if the thread gets scheduled on the same CPU, it is likely that it's data will still be available on the cahce, for instance yielding much faster memory accesses.

Note that this behavior is most likely to happen when you have more and more cores. And since it is sort of "random" where your thread will end up on it's next quantum, then this would explain the randomness of the performance.

There are out there profiling tools that allow you to register where your threads are being scheduled, such as perf for Linux. Usually these tools are particular to the topology where you're executing your program. Unfortunately none other comes to my mind right now. There are also ways of telling the OS to schedule threads on the same (or adyacent) CPUs, so they'll benefit from fewer cache misses. For that you can check this SO question

I would suggest that you ask your admin which tools like these you count with, so you can do a proper profiling and assignment of your thread scheduling.


You are mentioning a bi-modal performance profile which you see on one machine and not on the other. It is horrible, but this is normal, even for single threaded applications.

The problem is that there are far too many factors in a Linux system (any kernel, regardless of the scheduler used) which affect the application performance. It starts at address randomisation, and ends with microscopic timing differences escalating to huge context switch delays between processes.

Linux is not a real-time system. It just tries to be as efficient as possible on the average case.

There is a number of things you can do to minimize the performance variance:

Reduce the number of threads to the bare minimum necessary. Do not split different aspects of your problem up into threads. Just split up into threads when really necessary, for example to feed CPUs with independent (!) number crunching jobs. Try to do as much causally connected work in one thread as possible. You threads should require as little communication with each other as possible. In particular you should have no request/response patterns between threads where the latencies add up.

Assume your OS only be able to do about 1000 context switches between threads/processes per second. That means a couple of 100 request/response transactions per second. If you do a benchmark on Linux and you find you can do much more, ignore this.

Try to reduce the memory footprint of vital data. Distributed data tends to trash the cache with very subtle and hard-to-explain effect on performance.