What's the point of cache coherency? What's the point of cache coherency? multithreading multithreading

What's the point of cache coherency?


The short story is, non-cache coherent system are exceptionally difficult to program especially if you want to maintain efficiency - which is also the main reason even most NUMA systems today are cache-coherent.

If the caches wern't coherent, the "explicit steps" would have to enforce the coherency - explicit steps are usually things like critical sections/mutexes(e.g. volatile in C/C++ is rarly enough) . It's quite hard, if not impossible for services such as mutexes to keep track of only the memory that have changes and needs to be updated in all the caches -it would probably have to update all the memory, and that is if it could even track which cores have what pieces of that memory in their caches.

Presumable the hardware can do a much better and efficient job at tracking the memory addresses/ranges that have been changed, and keep them in sync.

And, imagine a process running on core 1 and gets preempted. When it gets scheduled again, it got scheduled on core 2.

This would be pretty fatal if the caches weren't choerent as otherwise there might be remnants of the process data in the cache of core 1, which doesn't exist in core 2's cache. Though, for systems working that way, the OS would have to enforce the cache coherency as threads are scheduled - which would probably be an "update all the memory in caches between all the cores" operation, or perhaps it could track dirty pages vith the help of the MMU and only sync the memory pages that have been changed - again, the hardware likely keep the caches coherent in a more finegrainded and effcient way.


There are some nuances not covered by the great responses from the other authors.

First off, consider that a CPU doesn't deal with memory byte-by-byte, but with cache lines. A line might have 64 bytes. Now, if I allocate a 2 byte piece of memory at location P, and another CPU allocates an 8 byte piece of memory at location P + 8, and both P and P + 8 live on the same cache line, observe that without cache coherence the two CPUs can't concurrently update P and P + 8 without clobbering each others changes! Because each CPU does read-modify-write on the cache line, they might both write out a copy of the line that doesn't include the other CPU's changes! The last writer would win, and one of your modifications to memory would have "disappeared"!

The other thing to bear in mind is the distinction between coherency and consistency. Because even x86 derived CPUs use store buffers, there aren't the guarantees you might expect that instructions that have already finished have modified memory in such a way that other CPUs can see those modifications, even if the compiler has decided to write the value back to memory (maybe because of volatile?). Instead the mods may be sitting around in store buffers. Pretty much all CPUs in general use are cache coherent, but very few CPUs have a consistency model that is as forgiving as the x86's. Check out, for example, http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html for more information on this topic.

Hope this helps, and BTW, I work at Corensic, a company that's building a concurrency debugger that you may want to check out. It helps pick up the pieces when assumptions about concurrency, coherence, and consistency prove unfounded :)


Imagine you do this:

lock(); //some synchronization primitive e.g. a semaphore/mutexglobalint = somevalue;unlock();

If there were no cache coherence, that last unlock() would have to assure that globalint are now visible everywhere, with cache coherance all you need to do is to write it to memory and let the hardware do the magic. A software solution would have keep tack of which memory exists in which caches, on which cores, and somehow make sure they're atomically in sync.

You'd win an award if you can find a software solution that keeps track of all the pieces of memory that exist in the caches that needs to be keept in sync, that's more efficient than a current hardware solution.