Does the Java Memory Model (JSR-133) imply that entering a monitor flushes the CPU data cache(s)? Does the Java Memory Model (JSR-133) imply that entering a monitor flushes the CPU data cache(s)? multithreading multithreading

Does the Java Memory Model (JSR-133) imply that entering a monitor flushes the CPU data cache(s)?


the absolute need to synchronize, even if cache coherency is guaranteed in hardware

Yes, but then you only have to reason against the Java Memory Model, not against a particular hardware architecture that your program happens to run on. Plus, it's not only about the hardware, the compiler and JIT themselves might reorder the instructions causing visibility issue. Synchronization constructs in Java addresses visibility & atomicity consistently at all possible levels of code transformation (e.g. compiler/JIT/CPU/cache).

and on the other hand bad performance on incoherent architectures (full cache flushes)

Maybe I misunderstood s/t, but with incoherent architectures, you have to synchronize critical sections anyway. Otherwise, you'll run into all sort of race conditions due to the reordering. I don't see why the Java Memory Model makes the matter any worse.

shouldn't it be more strict (require information what is guarded by a monitor)

I don't think it's possible to tell the CPU to flush any particular part of the cache at all. The best the compiler can do is emitting memory fences and let the CPU decides which parts of the cache need flushing - it's still more coarse-grained than what you're looking for I suppose. Even if more fine-grained control is possible, I think it would make concurrent programming even more difficult (it's difficult enough already).

AFAIK, the Java 5 MM (just like the .NET CLR MM) is more "strict" than memory models of common architectures like x86 and IA64. Therefore, it makes the reasoning about it relatively simpler. Yet, it obviously shouldn't offer s/t closer to sequential consistency because that would hurt performance significantly as fewer compiler/JIT/CPU/cache optimizations could be applied.


Existing architectures guarantee cache coherency, but they do not guarantee sequential consistency - the two things are different. Since seq. consistency is not guaranteed, some reorderings are allowed by the hardware and you need critical sections to limit them. Critical sections make sure that what one thread writes becomes visible to another (i.e., they prevent data races), and they also prevent the classical race conditions (if two threads increment the same variable, you need that for each thread the read of the current value and the write of the new value are indivisible).

Moreover, the execution model isn't as expensive as you describe. On most existing architectures, which are cache-coherent but not sequentially consistent, when you release a lock you must flush pending writes to memory, and when you acquire one you might need to do something to make sure future reads will not read stale values - mostly that means just preventing that reads are moved too early, since the cache is kept coherent; but reads must still not be moved.

Finally, you seem to think that Java's Memory Model (JMM) is peculiar, while the foundations are nowadays fairly state-of-the-art, and similar to Ada, POSIX locks (depending on the interpretation of the standard), and the C/C++ memory model. You might want to read the JSR-133 cookbook which explains how the JMM is implemented on existing architectures: http://g.oswego.edu/dl/jmm/cookbook.html.


The answer would be that most multiprocessors are cache-coherent, including big NUMA systems, which almost? always are ccNUMA.

I think you are somewhat confused as to how cache coherency is acomplished in practice. First, caches may be coherent/incoherent with respect to several other things on the system:

  • Devices
  • (Memory modified by) DMA
  • Data caches vs instruction caches
  • Caches on other cores/processors (the one this question is about)
  • ...

Something has to be made to maintain coherency. When working with devices and DMA, on architectures with incoherent caches with respect to DMA/devices, you would either bypass the cache (and possibly the write buffer), or invalidate/flush the cache around operations involving DMA/devices.

Similarly, when dynamically generating code, you may need to flush the instruction cache.

When it comes to CPU caches, coherency is achieved using some coherency protocol, such as MESI, MOESI, ... These protocols define messages to be sent between caches in response to certain events (e.g: invalidate-requests to other caches when a non-exclusive cacheline is modified, ...).

While this is sufficient to maintain (eventual) coherency, it doesn't guarantee ordering, or that changes are immediately visible to other CPUs. Then, there are also write buffers, which delay writes.

So, each CPU architecture provides ordering guarantees (e.g. accesses before an aligned store cannot be reordered after the store) and/or provide instructions (memory barriers/fences) to request such guarantees. In the end, entering/exiting a monitor doesn't entail flushing the cache, but may entail draining the write buffer, and/or stall waiting for reads to end.