Synchronization: The Basics

Synchronization mechanisms are constructed using user-level software routines that rely on hardware-supplied synchronization instructions. The fundamental capability required is an uninterruptible instruction, or a sequence of instructions, that can atomically retrieve and modify a memory value,. These mechanisms are utilized to build mutual exclusion operations, such as lock and unlock. Under high-contention conditions, synchronization frequently becomes a performance bottleneck due to the latency scaling and communication delays inherent in multiprocessor systems.

The implementation of these software-level synchronization routines necessitates specific hardware-level atomic operations to safely manage shared memory across multiple processors.

Basic Hardware Primitives

Hardware synchronization primitives must execute a memory read and a memory write as a single, indivisible operation. Without hardware support, the cost of constructing basic synchronization primitives scales poorly as processor counts increase.

  • Atomic Exchange (Swap): Interchanges a value in a register with a value in memory.
    • Creates a simple mutual exclusion lock where indicates a free state and indicates an unavailable state.
    • A processor attempts to acquire the lock by executing an atomic exchange with a register containing .
    • If the exchange returns , another processor holds the lock; if it returns , the lock was free and is now successfully set to ,.
    • Simultaneous exchange attempts are strictly ordered by write serialization, breaking the race condition and ensuring only one processor receives the .
  • Test-and-Set: Tests a memory value and updates that value only if the test condition is met (e.g., test for , set to ),.
  • Fetch-and-Increment: Returns the current value of a memory location and atomically increments it.

Implementing monolithic atomic operations is complex because the hardware must prevent any intervening operations between the read and the write without causing system deadlock.

To resolve the cache coherence complexities of monolithic atomic memory operations, modern architectures often decompose atomicity into a verifiable two-instruction sequence.

Load Reserved and Store Conditional (LR/SC)

Architectures such as MIPS and RISC-V implement effective atomicity using a pair of instructions that verify no other processor modified the memory location between the read and the write,. If the sequence appears to execute atomically, the write succeeds; otherwise, it fails.

  • Load Reserved (lr): Loads the initial value from memory and tracks the target address in a hardware register, often called the link or reserved register,.
  • Store Conditional (sc): Attempts to write a new value to the tracked memory address.
    • Returns (typically to a specified register) if the operation succeeds.
    • Returns a non-zero value if the operation fails.
  • Failure Conditions for sc:
    • An intervening modification to the target memory location by another processor.
    • A context switch or interrupt event that clears the link register.
    • A cache block invalidation corresponding to the tracked address.

Software implements synchronization by wrapping the lr and sc instructions in a loop, continually retrying the sequence if the sc instruction returns a failure code. The instruction sequence between lr and sc must be strictly limited to register-register instructions to prevent persistent sc failures and deadlock. Memory accesses, branches, or operating system calls are strictly prohibited within the sequence,.

These atomic hardware primitives directly enable the implementation of continuous polling mechanisms, which processors use to acquire exclusive access to shared resources.

Implementing Locks Using Coherence

Spin locks are synchronization structures where a processor continuously executes a tight loop (spins) attempting to acquire a lock. They are optimal in environments where the lock is held for a very short duration and low-latency acquisition is required when the lock frees.

  • Uncached Spin Locks: Processors continually execute atomic exchanges directly to main memory. This generates excessive interconnect traffic and limits scalability.
  • Cached Spin Locks: Utilize the cache coherence mechanism to maintain the lock value locally.
    • Allows the processor to spin on a local cached copy, drastically reducing global memory access latency and bandwidth consumption.
    • Capitalizes on spatial and temporal locality, as the processor that last held the lock is likely to acquire it again,.
  • The Read-Spin Optimization: Naively looping an atomic exchange on a cached lock generates continuous write misses and invalidation traffic, because each atomic exchange requests exclusive access,.
    • Instead, processors spin using standard read instructions on the local cached copy until the value indicates the lock is free ().
    • Upon reading a free state, the processor attempts the atomic exchange.
    • The first processor to successfully arbitrate the bus executes the swap, updates the value to , and generates an invalidation message.
    • Competing processors receive a cache miss on their pending swap, fetch the newly updated value (), fail the lock acquisition, and return to the read-only spin loop.

The optimized read-spin protocol functions effectively with atomic exchanges, but the distinct read and write phases of the LR/SC primitives natively streamline local cache spinning even further.

LR/SC in Spin Locks

The explicit separation of read and write actions in the lr/sc sequence eliminates the need for an independent read-only loop before attempting an atomic operation,.

  • The lr instruction executes without generating bus invalidation traffic, functioning natively as the initial local read check.
  • If the loaded value indicates the lock is unavailable (), the code immediately branches back to the lr instruction to continue spinning locally,.
  • If the value is available (), the sc instruction executes to claim the lock; if the sc fails due to a race condition with another processor, the code branches back to the lr spin loop,.