VC++: Performance drop x20 when more threads than cpus but not under g++ VC++: Performance drop x20 when more threads than cpus but not under g++ multithreading multithreading

VC++: Performance drop x20 when more threads than cpus but not under g++


With 8 threads your code is spinning, but getting the lock without the CPU having to suspend the thread before it looses its time slice.

As you add more and more threads the contention level increases, and therefore the chance that the thread will not be able to acquire the lock within its timeslice. When this happens the thread is suspended and a context swith occurs to another thread, which the CPU will examine to see if the thread can be woken up.

All this swithing, suspending and waking up requires a transition from user mode to kernel mode, and this is an expensive operation, thus performace is significantly impacted.

To improve things either reduce the number of threads contending the lock or increase the number of cores available. In your example you're using a std::atomic number, so you don't need to lock in order to call ++ on it, since it's already thread safe.


The mutex gives contention between each of the threads anyway, however if you try to use more threads than you have cores, even if they are ready, not all of them can run at once, so they will need to keep stopping and starting - known as context switching.

The only way you can "solve" this is to use fewer threads or get more cores.


Your problem is there are 8 threads store to a shared resource (not load, load a shared resource which can't modified is safe, and lock is needless).

  1. 8 threads > core num means
    • not every thread can run in a single cpu
    • there are more task schedules
  2. mutex
    • the thread can't acquired the mutext will sleep, and queued this thread to wait queue.(It seems the mutex implementation in windows use a short spin, then queued this thread to wait queue if not acquired the mutex?)

Write lock-free algorithm is hard, but in your problem, there is a way.

  1. If you can get more cores, get them
  2. use std::atomic<uint64_t> and delete the mutex, increase an atomic number is atomic by default(no special memory model).
  3. If the thread num is not constant, then change it to the core num, and then bind them

#include <chrono>#include <thread>#include <memory>#include <atomic>#include <sstream>#include <iostream>using namespace std::chrono;void thread_loop(std::atomic<uint64_t>* counter){    while (true)    {            (*counter)++;    }}int main(int argc, char* argv[]){    int threads = 9;    std::atomic<uint64_t> counter(0);    std::cout << "Starting " << threads << " threads.." << std::endl;    for (int i = 0; i < threads; ++i)        new std::thread(&thread_loop, &counter);    std::cout << "Started " << threads << " threads.." << std::endl;    while (1)    {        std::this_thread::sleep_for(seconds(1));        std::cout << "Counter = " << counter.load() << std::endl;    }}

This maybe faster. enjoy ;-)