Windows 10 poor performance compared to Windows 7 (page fault handling is not scalable, severe lock contention when no of threads > 16)

windows memory-management windows-7 windows-10 windows-server-2016

Microsoft seems to have fixed this issue with Windows 10 Fall Creators Update and Windows 10 Pro for Workstation.

Here is the updated graph.

Win 10 FCU and WKS has lower overhead than Win 7. In exchange, the VirtualLock seems to have higher overhead.

windows memory-management windows-7 windows-10 windows-server-2016

Unfortunately not an answer, just some additional insight.

Little experiment with a different allocation strategy:

#include <Windows.h>#include <thread>#include <condition_variable>#include <mutex>#include <queue>#include <atomic>#include <iostream>#include <chrono>class AllocTest{public:    virtual void* Alloc(size_t size) = 0;    virtual void Free(void* allocation) = 0;};class BasicAlloc : public AllocTest{public:    void* Alloc(size_t size) override {        return VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);    }    void Free(void* allocation) override {        VirtualFree(allocation, NULL, MEM_RELEASE);    }};class ThreadAlloc : public AllocTest{public:    ThreadAlloc() {        t = std::thread([this]() {            std::unique_lock<std::mutex> qlock(this->qm);            do {                this->qcv.wait(qlock, [this]() {                    return shutdown || !q.empty();                });                {                    std::unique_lock<std::mutex> rlock(this->rm);                    while (!q.empty())                    {                        q.front()();                        q.pop();                    }                }                rcv.notify_all();            } while (!shutdown);        });    }    ~ThreadAlloc() {        {            std::unique_lock<std::mutex> lock1(this->rm);            std::unique_lock<std::mutex> lock2(this->qm);            shutdown = true;        }        qcv.notify_all();        rcv.notify_all();        t.join();    }    void* Alloc(size_t size) override {        void* target = nullptr;        {            std::unique_lock<std::mutex> lock(this->qm);            q.emplace([this, &target, size]() {                target = VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);                VirtualLock(target, size);                VirtualUnlock(target, size);            });        }        qcv.notify_one();        {            std::unique_lock<std::mutex> lock(this->rm);            rcv.wait(lock, [&target]() {                return target != nullptr;            });        }        return target;    }    void Free(void* allocation) override {        {            std::unique_lock<std::mutex> lock(this->qm);            q.emplace([allocation]() {                VirtualFree(allocation, NULL, MEM_RELEASE);            });        }        qcv.notify_one();    }private:    std::queue<std::function<void()>> q;    std::condition_variable qcv;    std::condition_variable rcv;    std::mutex qm;    std::mutex rm;    std::thread t;    std::atomic_bool shutdown = false;};int main(){    SetProcessWorkingSetSize(GetCurrentProcess(), size_t(4) * 1024 * 1024 * 1024, size_t(16) * 1024 * 1024 * 1024);    BasicAlloc alloc1;    ThreadAlloc alloc2;    AllocTest *allocator = &alloc2;    const size_t buffer_size =1*1024*1024;    const size_t buffer_count = 10*1024;    const unsigned int thread_count = 32;    std::vector<void*> buffers;    buffers.resize(buffer_count);    std::vector<std::thread> threads;    threads.resize(thread_count);    void* reference = allocator->Alloc(buffer_size);    std::memset(reference, 0xaa, buffer_size);    auto func = [&buffers, allocator, buffer_size, buffer_count, reference, thread_count](int thread_id) {        for (int i = thread_id; i < buffer_count; i+= thread_count) {            buffers[i] = allocator->Alloc(buffer_size);            std::memcpy(buffers[i], reference, buffer_size);            allocator->Free(buffers[i]);        }    };    for (int i = 0; i < 10; i++)    {        std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();        for (int t = 0; t < thread_count; t++) {            threads[t] = std::thread(func, t);        }        for (int t = 0; t < thread_count; t++) {            threads[t].join();        }        std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();        std::cout << duration << std::endl;    }    DebugBreak();    return 0;}

Under all sane conditions, BasicAlloc is faster, just as it should be. In fact, on a quad core CPU (no HT), there is no constellation in which ThreadAlloc could outperform it. ThreadAlloc is constantly around 30% slower. (Which is actually surprisingly little, and it keeps true even for tiny 1kB allocations!)

However, if the CPU has around 8-12 virtual cores, then it eventually reaches the point where BasicAlloc actually scales negatively, while ThreadAlloc just "stalls" on the base line overhead of soft faults.

If you profile the two different allocation strategies, you can see that for a low thread count, KiPageFault shifts from memcpy on BasicAlloc to VirtualLock on ThreadAlloc.

For higher thread and core counts, eventually ExpWaitForSpinLockExclusiveAndAcquire starts emerging from virtually zero load to up to 50% with BasicAlloc, while ThreadAlloc only maintains the constant overhead from KiPageFault itself.

Well, the stall with ThreadAlloc is also pretty bad. No matter how many cores or nodes in a NUMA system you have, you are currently hard capped to around 5-8GB/s in new allocations, across all processes in the system, solely limited by single thread performance. All the dedicated memory management thread achieves, is not wasting CPU cycles on a contended critical section.

You would have expected that Microsoft had a lock free strategy for assigning pages on different cores, but apparently that's not even remotely the case.

The spin-lock was also already present in the Windows 7 and earlier implementations of KiPageFault. So what did change?

Simple answer: KiPageFault itself became much slower. No clue what exactly caused it to slow down, but the spin-lock simply never became a obvious limit, because 100% contention was never possible before.

If someone whishes to disassemble KiPageFault to find the most expensive part - be my guest.

CodeHunter

Windows 10 poor performance compared to Windows 7 (page fault handling is not scalable, severe lock contention when no of threads > 16)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last