Memory barrier vs Interlocked impact on memory caches coherency timing Memory barrier vs Interlocked impact on memory caches coherency timing multithreading multithreading

Memory barrier vs Interlocked impact on memory caches coherency timing


To understand C# interlocked operations, you need to understand Win32 interlocked operations.

The "pure" interlocked operations themselves only affect the freshness of the data directly referenced by the operation.

But in Win32, interlocked operations used to imply full memory barrier. I believe this is mostly to avoid breaking old programs on newer hardware. So InterlockedAdd does two things: interlocked add (very cheap, does not affect caches) and full memory barrier (rather heavy op).

Later, Microsoft realized this is expensive, and added versions of each operation that does no or partial memory barrier.

So there are now (in Win32 world) four versions of almost everything: e.g. InterlockedAdd (full fence), InterlockedAddAcquire (read fence), InterlockedAddRelease (write fence), pure InterlockedAddNoFence (no fence).

In C# world, there is only one version, and it matches the "classic" InterlockedAdd - that also does the full memory fence.


Short answer: CAS (Interlocked) operations have been (and most likely will) be the quickest caches flusher.

Background: - CAS operations are supported in HW by single uninteruptable instruction. Compared to thread calling memory barrier which can be swapped right after placing the barrier but just before performing any reads/writes (so consistency guaranteed for the barrier is still met). - CAS operations are foundations for majority (if not all) high level synchronization construct (mutexes, sempahores, locks - look on their implementation and you will find CAS operations). They wouldn't likely be used if they wouldn't guarantee immediate cross-thread state consistency or if there would be other, faster mechanism(s)


At least on Intel devices, a bunch of machinecode operations can be prefixed with a LOCK prefix, which ensures that the following operation is treated as atomic, even if the underlying datatype won't fit on the databus in one go, for example, LOCK REPNE SCASB will scan a string of bytes for a terminating zero, and won't be interrupted by other threads.As far as I am aware, the Memory Barrier construct is basically a CAS based spinlock that causes a thread to wait for some Condition to be met, such as no other threads having any work to do. This is clearly a higher-level construct, but make no mistake there's a condition check in there, and it's likely to be atomic, and also likely to be CAS-protected, you're still going to pay the cache line price when you reach a memory barrier.