Which CPU architectures support Compare And Swap (CAS)? Which CPU architectures support Compare And Swap (CAS)? multithreading multithreading

Which CPU architectures support Compare And Swap (CAS)?


Powerpc has more powerful primitives available: "lwarx" and "stwcx"

lwarx loads a value from memory but remembers the location. Any other thread or cpu that touches that location will cause the "stwcx", a conditional store instruction, to fail.

So the lwarx /stwcx combo allows you to implement atomic increment / decrement, compare and swap, and more powerful atomic operations like "atomic increment circular buffer index"


Sorry for a lot of letters. :(

Almost all instructions in the x86 ISA (except so called string instructions, and maybe few others), including CMPXCHG, are atomic in the context of unicore CPU. This is because according to the x86 architecture, CPU checks for arrived interrupts after each instruction execution completion and never in the middle. As a result, interrupt request can be detected and it handling be launched only on boundary between execution of two consecutive instructions. Due to this all memory references taken by CPU during execution of single instruction are isolated and can't be interleaved by any other activities. That behavior is common for unicore and multicore CPUs. But if in the context of unicore CPU there is only one unit of the system that performs access to the memory, in the context of multicore CPU there are more then one unit of the system which performs access to the memory simultaneously. Instruction isolation isn't enough for consistency in such environment, because memory accesses made by different CPUs in the same time can interleave each other. Due to this additional protection layer must be applied to the data changing protocol. For the x86 this layer is lock prefix, that initiates atomic transaction on the system bus.

enter image description here

Summary: It is safe and less costly to use sync instructions like CMPXCHG, XADD, BTS, etc. without lock prefix if you assured, that the data accessed by this instruction can be accessed only by one core. If you are not assured in this, apply lock prefix to provide safety by trading off performance.

There are two major approach for hardware synchronization support by CPU:

  1. Atomic transaction based.
  2. Cache coherence protocol based.

No one is silver bullet. Both approaches have they advantages and disadvantages.

Atomic transactions based approach relies to the supporting of the special type of transactions on the memory bus. During such transaction only one agent (CPU core) connected to the bus is eligible to access memory. As result, on the one hand, all memory references made by the bus owner during atomic transaction are assured to be made as a single uninterruptible transaction. On the another hand all other bus agents (CPU cores) will be enforced to wait the atomic transaction completion, to get back the ability to access memory. It doesn't matter, what memory cells they want to access, even if they want to access the memory region that is not referenced by bus owner during atomic transaction. As result extensive use of lock prefixed instructions will slow down the system significantly. On the other hand, due to the fact that the bus arbiter gives access to the bus for each bus agent according to the round robin scheduling, there is a guarantee that each bus agent will have relatively fair access to the memory and all agents will be able to made progress and made it with the same speed. In addition, ABA problem come into the play in case of atomic transactions, because by its nature, atomic transactions is very short (few memory references made by single instruction) and all actions taken on memory during transaction rely only to the value of memory region, without taking into the account, is that memory region was accessed by some one else between two transactions.Good example of atomic transaction based sync support is x86 architecture, in which lock prefixed instructions enforce CPU execute them in atomic transactions.

Cache coherence protocol based approach rely to the fact that the memory line can be cached only in the one L1 cache in the one instant of time. The memory access protocol in cache coherence system is similar to next sequence of actions:

  1. CPU A store the memory line X in L1 cache. In the same time CPU B desire to access memory line X. (X --> CPU A L1)
  2. CPU B issue memory line X access transaction on the bus. (X --> CPU A L1)
  3. All bus agents (CPU cores) have a so called snooping agent that listen all transactions on the bus and check if memory line access to which was requested by transaction is stored in its owner CPU L1 cache. So, CPU A snooping agent detects that CPU A owns the memory line requested by CPU B. (X --> CPU A L1)
  4. CPU A suspend memory access transaction issued by CPU B. (X --> CPU A L1)
  5. CPU A flush the memory line requested by B from its L1 cache. (X --> memory)
  6. CPU A resume previously suspended transaction. (X --> memory)
  7. CPU B fetch memory line X from the memory. (X --> CPU B L1)

Thank to that protocol CPU core always access the actual data in memory, and accesses to the memory are serialized in strict order, one access in time. Cache coherence protocol based sync support rely to the fact, that CPU can easily detect, that the particular memory line was accessed between two time points. During the first memory access to the line X that must open transaction, CPU can mark that memory line in L1 cache must be controlled by snooping agent. In its turn snooping agent can during cache line flush in addition perform check to identify is that line is marked for control, and raise internal flag if controlled line flushed. As result, if CPU will check the internal flag during memory access that close the transaction, it will know is controlled memory line was able to be changed by someone else and conclude is transaction must be accomplished with success or must be considered as failed.This is the way of LL\SC instruction class implementation. This approach more simple that atomic transaction and provides much more flexibility in synchronization, because much more number of different sync primitives can be build on it base in comparison with atomic transactions approach. This approach is more scalable and efficient, because it doesn't block access to the memory for all other parts of the system. And as you can see it solves the ABA problem, because it base on the fact of memory region access detection, but not on value of memory region change detection. Any access to the memory region participating in ongoing transaction will be considered as an transaction fail. And this can be good and bad in the same time, because particular algorithm can be interested only in the value of memory region and doesn't take in the account is that location was accessed by someone in the middle, until that access change the memory. In that case read of memory value in the middle will lead to false negative transaction fail. In addition that approach can lead to huge performance degradation of control flows contenting on the same memory line, because they are able to constantly steel memory line from each other, and by this preventing each other from completion transaction with success. That is really significant problem because in terminal case it can turn system in livelock.Cache coherence protocol based sync support usually used in RISC CPU, because of it simplicity and flexibility. But it must be noted that Intel decided to support such approach for synchronization support in x86 architecture too. At last year Intel announced the Transactional Synchronization Extensions to x86 architecture that will be implemented in Haswell generation of Intel processors. In result, it looks like, the x86 will have most powerful support of synchronization and allow system developers to use advantages of both approaches.


A different and easier way to answer this question may be to list multiprocessor platforms that do NOT support a compare and swap (or a load-link/store-conditional that can be used to write one).

The only one I know of is PARISC, which only has an atomic clear word instruction. This can be used to construct a mutex (provided one aligns the word on a 16 byte boundary). There is no CAS on this archetecture (unlike x86, ia64, ppc, sparc, mips, s390, ...)