Multiprocessor Cache Coherence

The Core Coherence Problem

A memory system is coherent if any read of a data item returns the most recently written value of that data item.

The coherence problem arises from the dichotomy between a global state (defined by main memory or a shared cache) and local states (defined by private caches, such as L1 and L2).
Without coherence management, multiple processors can hold and read different, stale values for a single memory location $X$ after it has been modified by one processor.
Memory system behavior is divided into two distinct aspects:
- Coherence: Defines what values can be returned by a read to a specific memory location.
- Consistency: Defines when a written value will be returned by a read, relative to accesses to other memory locations.

To guarantee that a read always returns the correct value, the memory system must enforce three strict behavioral properties.

Properties of Coherent Memory

A coherent memory system strictly enforces three properties:

Uniprocessor Read Coherency: A read by processor $P$ to location $X$ following a write by $P$ to $X$ must return the value written by $P$ , assuming no intervening writes to $X$ by other processors. This preserves basic sequential program order.
Multiprocessor Read Coherency: A read by any processor to location $X$ following a write by a different processor to $X$ must return the written value, provided the read and write are sufficiently separated in time and no other writes intervene. This ensures a processor cannot read an outdated value indefinitely.
Multiprocessor Write Serialization: All writes to the same memory location must be serialized. Any two writes to location $X$ by any two processors must be observed in the exact same sequence by all processors in the system.

The baseline consistency assumption for these properties requires that a write does not complete until all processors have observed its effect, and the processor finishes writes in program order without reordering them relative to other memory accesses.

Enforcing these theoretical properties allows the cache hierarchy to safely exploit the physical benefits of caching shared data.

Migration and Replication in Caches

Coherent caches natively support two mechanisms that optimize the handling of shared data:

Migration: Moving a data item into a local private cache.
- Reduces access latency for remotely allocated shared data.
- Decreases bandwidth demand on the shared main memory.
- Improves overall energy efficiency compared to executing off-chip memory accesses.
Replication: Creating multiple copies of a shared data item across different local caches for simultaneous reading.
- Significantly reduces read latency.
- Prevents access contention at a single shared data source.

To safely maintain migration and replication without violating write serialization and read coherency, the architecture must implement a hardware-level coherence protocol.

Coherence Enforcement Protocols

Hardware protocols maintain coherence through two primary strategies for handling writes to replicated data:

Write Invalidate Protocol: The writing processor must acquire exclusive access to a data item before writing it.
- Invalidates all other cached copies of the item across the system.
- If two processors attempt to write simultaneously, hardware arbitration selects a single winner, inherently enforcing multiprocessor write serialization.
- Subsequent reads by other processors will miss in their local caches, forcing a fetch of the newly updated value.
- This is the default protocol used in virtually all modern multiprocessors due to bandwidth efficiency.
Write Update (Write Broadcast) Protocol: The writing processor broadcasts the new data to update all other cached copies simultaneously.
- Requires significantly higher bandwidth compared to invalidation, making it impractical for modern architectures.

Executing an invalidation protocol requires a mechanism to actively track the sharing status and physical location of every cache block.

State tracking utilizes status bits associated with each cache block, analogous to standard uniprocessor valid and dirty bits. Architectures track sharing status using two primary classes of coherence protocols:

Snooping Protocols: Sharing status is tracked locally by every cache holding a copy of the physical memory block.
- Requires a broadcast medium (e.g., a shared bus) to transmit coherence transactions to every cache.
- Each cache controller “snoops” (monitors) the medium to check if it holds the requested block and applies invalidations locally.
Directory-Based Protocols: Sharing status for a block of physical memory is maintained in a centralized or distributed location called a directory.
- The directory is typically located at a shared Last Level Cache (LLC) or at the memory interface.
- Distributed directories are utilized to effectively scale multiprocessors beyond a single chip.
Hybrid Protocols: Modern multicore designs frequently combine both approaches.
- Snooping maintains coherence among local nonshared caches (L1 and L2) within a tightly coupled group or socket.
- A directory at the LLC or memory interface limits broadcast scope by dictating which specific processors must be snooped to maintain global coherence.

My Knowledge Base

Explorer

01 Cache Coherence

Multiprocessor Cache Coherence

The Core Coherence Problem

Properties of Coherent Memory

Migration and Replication in Caches

Coherence Enforcement Protocols

My Knowledge Base

Explorer

01 Cache Coherence

Multiprocessor Cache Coherence

The Core Coherence Problem

Properties of Coherent Memory

Migration and Replication in Caches

Coherence Enforcement Protocols

Tracking Sharing Status: Snooping and Directory Protocols