Even faster inexpensive thread-safe counter? Even faster inexpensive thread-safe counter? multithreading multithreading

Even faster inexpensive thread-safe counter?


Operations from multiple cores on the same cache line contend in hardware. This is true for locked and for regular memory access. This is a real problem. Contending accesses do not scale at all when more cores are added. Scaling typically is hard negative.

You need to use multiple cache lines with each core using its own most of the time.

You could use a ThreadLocal<Holder> and class Holder { public int I; } for that. ThreadLocal supports enumerating all instances that have been created so that you can sum them. You also could use a struct that is padded to the cache line size. That's safer.

Note, that it is not important to use one counter per core. Per-thread is good enough because time quantums are incredibly long compared to increment operations. A few bad accesses are not a performance problem.

A faster option would be to use a Holder[]. Each thread draws a random array index once and then accesses that holder object. Array indexing is faster than thread local access. If the number of holder instances you use is much bigger (10x) than the number of threads there will be little contention. Most writes will go the same already cached line.

Instead of a random index you can use a List<Holder> and add items as more threads join the processing.


I thought I would offer some clarification about cache coherence and what the LOCK prefix does in Intel architectures. Since it's too long for a comment and also answers some points your raised I think it's appropriate to post as an answer.

In the MESI cache coherence protocol, any write to a cache line will cause the state to change to exclusive, regardless of whether you are using the LOCK prefix or not. So if two processors are both accessing the same cache line repeatedly, and at least 1 of the processors is doing writes, then the processors will experience cache line misses when accessing the line that they are sharing. Whereas if they were both only reading from the line then they would have cache line hits because they could both keep the line in their private L1 cache in the shared state.

What the LOCK prefix does is restrict the amount of speculative work that a processor can do while waiting for the the locked instruction to finish executing. Section 8.1.2 of the Intel 64 and IA-32 Architectures Software Developer’s Manual says:

Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor.

Under normal circumstances the processor is able to speculatively execute instructions while waiting for a cache miss to be resolved. But the LOCK prefix prevents this and essentially stalls the pipeline until the locked instruction finished execution.