C++ Producer consumer queue with (very) fast and reliable handover C++ Producer consumer queue with (very) fast and reliable handover multithreading multithreading

C++ Producer consumer queue with (very) fast and reliable handover


One way amortise the overhead of locking and thread wakeup is to add a second queue and implement a double-buffering approach. This enables batch-processing at the consumer side:

template<typename F>std::size_t consume_all(F&& f){       // minimize the scope of the lock    {        std::lock_guard<std::mutex> lock(the_mutex);        std::swap(the_queue, the_queue2);    }    // process all items from the_queue2 in batch    for (auto& item : the_queue2)    {        f(item);    }    auto result = the_queue2.size();            the_queue2.clear(); // clears the queue and preserves the memory. perfect!    return result;}

Working sample code.

This does not fix the latency issue, but it can improve throughput. If a hiccup occurs then the consumer will be presented with a large batch which can then be processed at full speed without any locking overhead. This allows the consumer to quickly catch up with the producer.


There are lots of things that might cause problems.

You probably need to try to profile the app and see where slowdowns might be occurring.

Some notes:

  • Are the consumer and producer in the same process? If so, a Critical Section is much faster than a Mutex.
  • Try to ensure all the queue memory is in in current memory. If you have to swap pages that will slow right down.
  • Be very careful setting your process to real time priority. That is supposed to be for system processes. If the process does too much work, it can prevent a critical system process getting cpu, which can end very badly. Unless you absolutely need real time, just use HIGH_PRIORITY_CLASS


The short answer is yes, from there it really is down to operating system management, and thread scheduling. RTSs (Real time systems) can bring those 50 micros to about 15 micros, and more importantly, they can get rid of the outliers. Otherwise spinning is the only answer. If there are more queues than cores, the idea might be to have x number of threads spinning, to react immediately, and the remaining blocking. That would involve some kind of "master" queue thread, constantly spinning to check all queues, and - either processing items itself - or handing them over to worker threads, of which some could also be spinning to save those 50 micros. It gets complicated, though.

Probably best would be to just use a single lock-free multiple-producer-single-consumer queue with a spinning consumer thread. Then all items that go into the queues would probably need to derive from a common base type, and would need to contain some meta-info as to what to do with the items.

Complicated, but possible. If I ever set it up, I might post some code as well.