Synchronization

Synchronization mechanisms are constructed using user-level software routines that rely on hardware-supplied synchronization instructions. The fundamental capability required is an uninterruptible instruction, or a sequence of instructions, that can atomically retrieve and modify a memory value. These mechanisms are utilized to build mutual exclusion operations, such as lock and unlock.

The implementation of these software-level synchronization routines necessitates specific hardware-level atomic operations to safely manage shared memory across multiple processors.

Hardware Primitives

Hardware synchronization primitives must execute a memory read and a memory write as a single, indivisible operation. Without hardware support, the cost of constructing basic synchronization primitives scales poorly as processor counts increase.

  • Atomic Exchange (Swap): Interchanges a value in a register with a value in memory.
    • Creates a simple mutual exclusion lock where indicates a free state and indicates an unavailable state.
    • A processor attempts to acquire the lock by executing an atomic exchange with a register containing .
    • If the exchange returns , another processor holds the lock; if it returns , the lock was free and is now successfully set to .
    • Simultaneous exchange attempts are strictly ordered by write serialization, breaking the race condition and ensuring only one processor receives the .
  • Test-and-Set: Tests a memory value and updates that value only if the test condition is met (e.g., test for , set to ).
  • Fetch-and-Increment: Returns the current value of a memory location and atomically increments it.

Implementing monolithic atomic operations is complex because the hardware must prevent any intervening operations between the read and the write without causing system deadlock.

LR/SC

RISC-V implements atomicity using two instructions rather than a single atomic swap.

  • lr (load reserved): Loads a value from the address in rs1 into rd, and records that address in a hidden hardware register called the reserved register.
  • sc (store conditional): Stores the value in rs2 to the address in rs1, then checks whether that address still matches the reserved register. If it matches, the store succeeds and writes 0 to rd. If not, it fails and writes nonzero to rd.

The reserved register is cleared by any of the following, causing the next sc to fail:

  • Another processor writes to the reserved address (its sc invalidates the cache block).
  • A cache block invalidation for that address arrives from elsewhere.
  • A context switch or interrupt occurs.

Only register-register instructions may appear between lr and sc. Memory accesses, branches, and OS calls risk deadlock or persistent failures. The sequence should be kept short to minimize spurious failures.

The following atomically exchanges x4 with the memory location in x1:

try: mov  x3, x4        ; copy exchange value
     lr   x2, 0(x1)     ; load reserved from x1
     sc   x3, 0(x1), x5 ; store x3; x5 = 0 if success
     bnez x5, try       ; retry if sc failed
     mov  x4, x2        ; put old memory value into x4

If any processor writes to x1 between lr and sc, x5 is nonzero and the loop retries. Write serialization guarantees only one processor wins the race.

Spin Locks

A spin lock is a lock where the processor continuously retries in a tight loop until it acquires the lock. They are optimal when the lock is held for a very short duration. Assume an atomic EXCH rd, 0(rs) instruction exists that atomically swaps rd with the memory word at rs, returning the old value in rd.

A naive implementation loops directly on EXCH, retrying immediately on failure:

lockit: addi x2, x0, 1    ; x2 = 1 (locked value)
        EXCH x2, 0(x1)    ; atomic swap; x2 = old lock value
        bnez x2, lockit   ; if old value was 1, still locked — retry

Every EXCH requests exclusive access to the cache block, generating a write miss and an invalidation broadcast to all other processors. Under contention, this floods the interconnect with coherence traffic even though the lock is still held and no progress is possible.

The fix is to spin on a plain read first, which hits the local cache with no bus traffic. Only when the lock looks free does the processor attempt EXCH:

lockit: ld   x2, 0(x1)    ; read lock — cached, no bus traffic
        bnez x2, lockit   ; spin locally while lock = 1
        addi x2, x0, 1    ; x2 = 1
        EXCH x2, 0(x1)    ; attempt to claim lock
        bnez x2, lockit   ; another processor won the race — retry

The first branch spins on a cached read with no bus traffic. The second branch handles the race when multiple processors see the lock free simultaneously — only one EXCH returns 0.

The LR/SC equivalent eliminates the need for a separate read instruction — lr is itself a plain read that generates no bus traffic, so the two branches map naturally:

lockit: lr   x2, 0(x1)     ; load reserved — no bus traffic
        bnez x2, lockit    ; spin locally while lock = 1
        addi x2, x0, 1     ; x2 = 1 (locked value)
        sc   x2, 0(x1), x3 ; attempt to claim lock
        bnez x3, lockit    ; sc failed — another processor won, retry

The table below traces cache coherence steps for three processors contending on a spin lock under write-invalidate coherence. P0 starts holding the lock (exclusive, value = 1). P0 releases it at step 2; P1 and P2 race to claim it in steps 3–5; P2 wins and enters the critical section, while P1 fails and returns to spin waiting. In a real system each step spans many more clock cycles due to bus arbitration and miss penalties.

StepP0P1P2Coherence stateBus/directory activity
1Has lockBegins spin, testing if lock = 0Begins spin, testing if lock = 0SharedCache misses for P1 and P2 satisfied in either order. Lock state becomes shared.
2Sets lock to 0(Invalidate received)(Invalidate received)Exclusive (P0)Write invalidate of lock variable from P0.
3Cache missCache missSharedBus/directory services P2 cache miss; write-back from P0; state shared.
4(Waits while bus/directory busy)Lock = 0 test succeedsSharedCache miss for P2 satisfied.
5Lock = 0Executes swap, gets cache missSharedCache miss for P1 satisfied.
6Executes swap, gets cache missCompletes swap: returns 0 and sets lock = 1Exclusive (P2)Bus/directory services P2 cache miss; generates invalidate; lock is exclusive.
7Swap completes and returns 1, sets lock = 1Enters critical sectionExclusive (P1)Bus/directory services P1 cache miss; sends invalidate and generates write-back from P2.
8Spins, testing if lock = 0None