VC++: Performance drop x20 when more threads than cpus but not under g++

c++ multithreading visual-c++ c++11 visual-studio-2013

With 8 threads your code is spinning, but getting the lock without the CPU having to suspend the thread before it looses its time slice.

As you add more and more threads the contention level increases, and therefore the chance that the thread will not be able to acquire the lock within its timeslice. When this happens the thread is suspended and a context swith occurs to another thread, which the CPU will examine to see if the thread can be woken up.

All this swithing, suspending and waking up requires a transition from user mode to kernel mode, and this is an expensive operation, thus performace is significantly impacted.

To improve things either reduce the number of threads contending the lock or increase the number of cores available. In your example you're using a std::atomic number, so you don't need to lock in order to call ++ on it, since it's already thread safe.

c++ multithreading visual-c++ c++11 visual-studio-2013

The mutex gives contention between each of the threads anyway, however if you try to use more threads than you have cores, even if they are ready, not all of them can run at once, so they will need to keep stopping and starting - known as context switching.

The only way you can "solve" this is to use fewer threads or get more cores.

c++ multithreading visual-c++ c++11 visual-studio-2013

Your problem is there are 8 threads store to a shared resource (not load, load a shared resource which can't modified is safe, and lock is needless).

8 threads > core num means
- not every thread can run in a single cpu
- there are more task schedules
mutex
- the thread can't acquired the mutext will sleep, and queued this thread to wait queue.(It seems the mutex implementation in windows use a short spin, then queued this thread to wait queue if not acquired the mutex?)

Write lock-free algorithm is hard, but in your problem, there is a way.

If you can get more cores, get them
use std::atomic<uint64_t> and delete the mutex, increase an atomic number is atomic by default(no special memory model).
If the thread num is not constant, then change it to the core num, and then bind them

#include <chrono>#include <thread>#include <memory>#include <atomic>#include <sstream>#include <iostream>using namespace std::chrono;void thread_loop(std::atomic<uint64_t>* counter){    while (true)    {            (*counter)++;    }}int main(int argc, char* argv[]){    int threads = 9;    std::atomic<uint64_t> counter(0);    std::cout << "Starting " << threads << " threads.." << std::endl;    for (int i = 0; i < threads; ++i)        new std::thread(&thread_loop, &counter);    std::cout << "Started " << threads << " threads.." << std::endl;    while (1)    {        std::this_thread::sleep_for(seconds(1));        std::cout << "Counter = " << counter.load() << std::endl;    }}

This maybe faster. enjoy ;-)

CodeHunter

VC++: Performance drop x20 when more threads than cpus but not under g++

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last