Cache Coherence

The Coherence Problem

A memory system is coherent if any read of a data item returns the most recently written value of that data item.

The coherence problem arises from the dichotomy between a global state (defined by main memory or a shared cache) and local states (defined by private caches, such as L1 and L2).
Without coherence management, multiple processors can hold and read different, stale values for a single memory location after it has been modified by one processor.

Time	Event	Cache A	Cache B	Memory X
0	—	—	—	1
1	Processor A reads X	1	—	1
2	Processor B reads X	1	1	1
3	Processor A stores 0 into X	0	1 (stale)	0

Memory system behavior is divided into two distinct aspects:
- Coherence: Defines what values can be returned by a read to a specific memory location.
- Consistency: Defines when a written value will be returned by a read, relative to accesses to other memory locations.

To guarantee that a read always returns the correct value, the memory system must enforce three strict behavioral properties.

Coherence Properties

A coherent memory system strictly enforces three properties:

Uniprocessor Read Coherency: A read by processor $P$ to location $X$ following a write by $P$ to $X$ must return the value written by $P$ , assuming no intervening writes to $X$ by other processors. This preserves basic sequential program order.
Multiprocessor Read Coherency: A read by any processor to location $X$ following a write by a different processor to $X$ must return the written value, provided the read and write are sufficiently separated in time and no other writes intervene. This ensures a processor cannot read an outdated value indefinitely.
Multiprocessor Write Serialization: All writes to the same memory location must be serialized. Any two writes to location $X$ by any two processors must be observed in the exact same sequence by all processors in the system.

The baseline consistency assumption for these properties requires that a write does not complete until all processors have observed its effect, and the processor finishes writes in program order without reordering them relative to other memory accesses.

Enforcing Coherence

Coherent caches natively support two mechanisms that optimize the handling of shared data:

Migration: Moving a data item into a local private cache.
- Reduces access latency for remotely allocated shared data.
- Decreases bandwidth demand on the shared main memory.
- Improves overall energy efficiency compared to executing off-chip memory accesses.
Replication: Creating multiple copies of a shared data item across different local caches for simultaneous reading.
- Significantly reduces read latency.
- Prevents access contention at a single shared data source.

Hardware protocols maintain coherence through two primary strategies for handling writes to replicated data:

Write Invalidate Protocol: The writing processor must acquire exclusive access to a data item before writing it.
- Invalidates all other cached copies of the item across the system.
- If two processors attempt to write simultaneously, hardware arbitration selects a single winner, inherently enforcing multiprocessor write serialization.
- Subsequent reads by other processors will miss in their local caches, forcing a fetch of the newly updated value.
- This is the default protocol used in virtually all modern multiprocessors due to bandwidth efficiency.

Processor activity	Bus activity	Cache A	Cache B	Memory X
—	—	—	—	0
Processor A reads X	Cache miss for X	0	—	0
Processor B reads X	Cache miss for X	0	0	0
Processor A writes 1 to X	Invalidation for X	1	invalid	0
Processor B reads X	Cache miss for X	1	1	1

When A writes, it broadcasts an invalidation on the bus; B’s copy is invalidated. On B’s subsequent read it takes a cache miss — A responds with the updated value (canceling the memory response), and memory is updated to 1. This update on sharing simplifies the protocol; alternatively, ownership bits can defer the write-back until block replacement. If the multicore uses a shared LLC, coherence applies only to the private L1/L2 caches, and the LLC acts as the memory in this example — requiring the LLC to be inclusive.

Write Update (Write Broadcast) Protocol: The writing processor broadcasts the new data to update all other cached copies simultaneously.
- Requires significantly higher bandwidth compared to invalidation, making it impractical for modern architectures.

Executing an invalidation protocol requires a mechanism to actively track the sharing status and physical location of every cache block.

State tracking utilizes status bits associated with each cache block, analogous to standard uniprocessor valid and dirty bits. Architectures track sharing status using two primary classes of coherence protocols:

Snooping Protocols: Sharing status is tracked locally by every cache holding a copy of the physical memory block.
- Requires a broadcast medium (e.g., a shared bus) to transmit coherence transactions to every cache.
- Each cache controller “snoops” (monitors) the medium to check if it holds the requested block and applies invalidations locally.
Directory-Based Protocols: Sharing status for a block of physical memory is maintained in a centralized or distributed location called a directory.
- The directory is typically located at a shared Last Level Cache (LLC) or at the memory interface.
- Distributed directories are utilized to effectively scale multiprocessors beyond a single chip.
Hybrid Protocols: Modern multicore designs frequently combine both approaches.
- Snooping maintains coherence among local nonshared caches (L1 and L2) within a tightly coupled group or socket.
- A directory at the LLC or memory interface limits broadcast scope by dictating which specific processors must be snooped to maintain global coherence.

Snooping

Snooping maintains cache coherence within a single multiprocessor chip or a small group of processors connected by a broadcast medium, such as a bus.

Invalidation Process:
- To write to a shared block, a processor acquires access to the broadcast medium and broadcasts the target address.
- All other processors continuously snoop the medium, compare the broadcast address against their local cache tags, and invalidate the block if a match is found.
Write Serialization:
- Simultaneous writes to the same block arbitrate for access to the broadcast medium; the winner executes its broadcast and invalidates competing copies.
- The losing processor must fetch a new, updated copy of the data before it can complete its write.
- Writes to shared data fundamentally stall and cannot complete until broadcast medium access is secured.
Locating Data:
- In systems with write-through caches, the most recent value is always retrievable from main memory.
- In systems with write-back caches, the most recent value may reside in a private cache; the processor holding the modified (dirty) copy detects the read miss on the broadcast medium, provides the block to the requester, and forces the main memory access to abort.
- Because write-back caches significantly reduce global memory bandwidth demands, they are standard across multicore architectures for L2 and L3 caches.

The diagram has three states — Invalid, Shared (read only), Exclusive (read/write) — and two sets of transitions: left side driven by the local CPU, right side driven by bus transactions observed while snooping.

CPU-side transitions:

From	Stimulus	To	Bus action
Invalid	CPU read miss	Shared	Place read miss on bus
Invalid	CPU write miss	Exclusive	Place write miss on bus
Shared	CPU write	Exclusive	Place invalidate on bus
Exclusive	CPU read miss (replacement)	Shared	Write back block + place read miss on bus
Exclusive	CPU write miss (replacement)	Exclusive	Write back block + place write miss on bus

CPU read/write hits in Shared and Exclusive require no bus action and no state change.

Bus-side transitions: Whenever a bus transaction occurs, every private cache holding the specified block takes the action below:

From	Bus stimulus	To	Action
Shared	Invalidate or write miss for this block	Invalid	Discard local copy
Exclusive	Read miss for this block	Shared	Write back block, abort memory response
Exclusive	Write miss for this block	Invalid	Write back block, discard

In a multicore with a shared inclusive LLC, the LLC acts as memory in this protocol and the bus is the interconnect between private L1/L2 caches and the LLC. Inclusivity is required so the LLC always knows which private caches hold a given block.

MSI Protocol

Coherence is managed by a finite-state controller in each core that responds to both local processor operations and external broadcast requests. The baseline protocol utilizes three states (MSI):

Modified (M): The cache block has been updated in the private cache and is exclusive to that specific cache.
Shared (S): The cache block is unmodified, is up-to-date in main memory, and is potentially cached by multiple processors.
Invalid (I): The cache block does not contain valid data.
State Transitions:
- Any valid memory block exists either in the S state across one or more caches or in the M state in exactly one cache.
- A local write operation requires transitioning the block to the M state, which mandates placing an invalidate or write miss on the broadcast medium to force all remote caches to transition their matching blocks to the I state.
- If a remote read miss is snooped for a block currently in the M state, the owning cache writes back the data to memory and downgrades its local state to S.
Tag Management:
- Every broadcast transaction requires checking cache-address tags, which can interfere with the processor’s own cache accesses.
- Hardware often duplicates cache tags to allow background snoop accesses without halting processor execution.

Protocol Extensions

To optimize network traffic and maintain physical correctness, the basic MSI protocol is augmented with additional states and serialization guarantees.

State Extensions:
- MESI Protocol: Adds an Exclusive (E) state denoting a block that is resident in only a single cache but remains unmodified.
  - A block in the E state can be written to (transitioning to M) without generating an invalidation broadcast, heavily optimizing data that is read and then subsequently written by the same processor.
  - If another processor reads the block, the state downgrades to S.
  - Intel architectures utilize a variant called MESIF, which adds a Forward (F) state to designate exactly which sharing processor is responsible for responding to a miss.
- MOESI Protocol: Adds an Owned (O) state to the MESI protocol.
  - Allows a block to transition from M to O without writing back to main memory, indicating that the cache is the owner and the main memory copy is stale.
  - Other caches reading the block receive it in the S state, and the owner (in the O state) is responsible for supplying the value on a miss and writing it back upon replacement.
Atomicity and Serializability Constraints:
- Serializability: All coherence events must pass through a single common serialization point (such as a shared bus or shared cache) to ensure every processor observes events in the exact same order.
- Atomicity: Coherence events span multiple clock cycles but must logically appear atomic to the system.
  - Achieved by introducing transient (hidden) states within the cache controller while it waits for a memory or bus response.
  - If a cache block in a transient state receives an external write miss or invalidate, it immediately reverts to the I state, forcing the local processor to retry the operation upon completion of the current interference.

Scalability Limits

As multiprocessor systems grow, centralized broadcast resources and the snoop bandwidth at individual caches become severe physical bottlenecks.

Snoop Bandwidth Contention:
- In a pure snooping system, every cache must process every miss in the system, threatening to consume all available cache bandwidth.
- For a theoretical 16-processor system (4.0 GHz clock, CPI of 0.5, 40% load/store frequency, 15-cycle L2 request), keeping coherence traffic below 50% of total L2 bandwidth requires the coherence miss rate to remain strictly below $0.26%$ .
Bandwidth Scaling Techniques:
- Inclusive Shared LLC: A shared Last Level Cache (LLC) acts as a primary snoop filter. Snoops interrogate the LLC first; L2 caches are only snooped if the LLC confirms a hit, isolating private caches from irrelevant coherence traffic.
- Point-to-Point Snooping: Architectures like AMD Opteron replace the shared bus with point-to-point links. They broadcast to connected chips and use explicit acknowledgment messages to determine when an invalidation has physically completed across the network.

Cache Performance

Overall cache performance in a multiprocessor is an amalgamation of standard uniprocessor misses (compulsory, capacity, conflict) and coherence misses.

Classification of Coherence Misses:
- True Sharing Misses: Occur when a processor writes to a shared block (invalidating it globally), and another processor subsequently misses when attempting to read that newly updated word.
- False Sharing Misses: Occur when a processor invalidates a block by writing to a word, causing a miss for another processor that is reading a completely different word residing within the identical cache block. This miss is purely an artifact of the block size and would not occur if the block size was a single word.
Workload Scaling Dynamics (OLTP Workload Example):
- Impact of Cache Size: Increasing the L3 cache size systematically eliminates uniprocessor capacity and conflict misses. However, true and false sharing misses remain completely static regardless of cache size, causing coherence misses to dominate overall memory stall time in large caches (e.g., 4–8 MiB).
- Impact of Block Size:
  - Increasing block size (e.g., from 32 bytes to 256 bytes) effectively reduces compulsory and capacity/conflict misses due to increased spatial locality.
  - However, false sharing misses nearly double with larger block sizes, as a larger block encompasses more independent data structures, greatly increasing the probability of false invalidations.
  - While larger blocks lower the absolute miss rate, they significantly amplify total data traffic to memory, creating contention that can easily negate the performance gains achieved by the lowered miss rate.

Directories

Snooping protocols require broadcast communication on every cache miss, which limits scalability in distributed-memory multiprocessors. Directory protocols resolve this by maintaining the sharing status of every cached physical memory block in a single, known location called a directory.

Single Multicore (Inclusive LLC): The directory is implemented as a bit vector per Last Level Cache (LLC) block, indicating which private L2 caches hold copies.
Single Multicore (Non-inclusive LLC): Directory information is maintained in a separate hardware structure or via a copied set of the L2 tags.
Multichip NUMA Systems: Directories are distributed alongside physical memory, ensuring that different coherence requests route to different memory interfaces.

Each node is a multicore chip with its own portion of memory and an associated directory. The directory tracks which caches across all nodes share each memory block in that node’s memory. Directory information may reside on or off the chip. Both directory maintenance and coherence actions within a node are handled by the coherence mechanism.

Directory Protocol Basics

The directory tracks both the state of each block and the specific nodes holding copies to eliminate broadcast requirements.

Node Classifications:
- Local node: The node originating the memory request.
- Home node: The node housing the physical memory location and its corresponding directory entry.
- Remote node: A node currently holding a cached copy of the block, either in a shared or exclusive state.
Message Types:
- Requests (Local $\to$ Home): Read miss, Write miss.
- Data responses (Home $\to$ Local, Remote $\to$ Home): Data value replies, Data write-backs.
- Interventions (Home $\to$ Remote): Invalidate, Fetch, Fetch/invalidate.
- Acknowledgments: Essential for tracking the completion of invalidations before allowing a write to proceed, thereby enforcing memory consistency.
Core Directory States:
- Shared: One or more nodes cache the block, and the main memory copy is up to date.
- Uncached: No nodes currently cache the block.
- Modified (Exclusive): Exactly one node (the owner) caches the block and has modified it, rendering main memory out of date.
Sharers Set: A data structure—typically a bit vector for systems with $< 64$ nodes—recording which specific nodes currently cache the block.

Directory State Transitions

Directory actions are triggered by external messages, specifically read misses, write misses, and data write-backs.

Uncached State Transitions:
- Read miss: The directory sends data to the requesting node, adds the node to the Sharers set, and transitions to the Shared state.
- Write miss: The directory sends data to the requesting node, sets the Sharers set to identify this node as the exclusive owner, and transitions to the Exclusive state.
Shared State Transitions:
- Read miss: The directory sends data to the requesting node and adds the node to the Sharers set.
- Write miss: The directory sends data to the requesting node, transmits invalidate messages to all nodes currently in the Sharers set, updates Sharers to contain only the requesting node, and transitions to the Exclusive state.
Exclusive State Transitions:
- Read miss: The directory sends a data fetch message to the owner. The owner transitions its local cache state to Shared and sends the data back to the directory. The directory updates main memory, sends the data to the requester, adds the requester to the Sharers set, and transitions to the Shared state.
- Write miss: The directory sends a fetch/invalidate message to the old owner. The old owner invalidates its local copy and sends the data to the directory, which routes it to the new requester. The Sharers set is updated to the new owner, and the directory state remains Exclusive.
- Data write-back: Triggered when the owner replaces the block. The directory updates main memory, clears the Sharers set, and transitions to the Uncached state.

The diagram has three states — Invalid, Shared (read only), Exclusive (read/write) — identical to the snooping controller. Local processor requests are shown in black; messages from the home directory in gray. Explicit invalidates and write-backs replace the bus-broadcast write misses of snooping.

Processor-side transitions (black):

From	Stimulus	To	Action sent to home
Invalid	Read miss	Shared	Send read miss
Invalid	Write miss	Exclusive	Send write miss
Shared	Write	Exclusive	Send invalidate (upgrade)
Exclusive	Replacement	Invalid	Send write-back

Read/write hits in Shared or Exclusive require no message.

Directory-side transitions (gray):

From	Message from home	To	Action
Shared	Invalidate	Invalid	Discard local copy
Exclusive	Fetch	Shared	Send data to directory
Exclusive	Fetch/invalidate	Invalid	Send data to directory, discard

Hybrid Implementations

Modern multichip architectures utilize a hybrid of snooping and directory-based coherence to manage complexity and bandwidth.

Intra-chip Coherence: Hardware at the LLC level maintains coherence among the private core caches. If the LLC is inclusive, directory information associates directly with the LLC block tags.
Inter-chip Coherence: Directories are located at the memory interface to track which socket LLCs hold copies of a block. Directory state per block is small — 1 to 16 bits.
Directory Caching: Directory information can be cached to reduce physical storage overhead. A directory cache miss is handled conservatively by assuming the block is present and broadcasting invalidates to all clusters.
Snoop-Directory Reversal: In architectures like AMD EPYC, snooping occurs at the LLC directories. An LLC directory hit triggers explicit invalidates only to the specific L2 caches holding copies.

NUMA Performance

NUMA architectures utilizing distributed directories are evaluated using highly parallel workloads, such as WWW search indexing.

Workload Characteristics (WWW Search):
- Exhibits high request-level parallelism with minimal interprocess communication, yielding near-linear speedup with increasing core counts.
- System performance is strictly bottlenecked by capacity misses in the LLC, as true sharing and conflict misses are negligible. Switching to a fully associative cache reduces the miss rate by only 7%, confirming capacity misses dominate.
- L3 misses must access off-chip DRAM, consuming significantly more latency and energy ( $> 10 \times$ ) than L3 hits.
Block Size Optimizations:
- Increasing cache block size decreases the capacity miss rate by exploiting spatial locality, but simultaneously increases interconnect traffic.
- L1 Instruction and L3 caches see significant miss rate reductions with larger blocks (e.g., $128$ B or $256$ B).
- L1 Data cache miss rates drop initially up to $64$ B blocks, but increase at larger sizes.
- For L3 specifically, increasing block size yields only a $1.2 \times$ traffic increase alongside a significant miss rate reduction, making it a compelling tradeoff.
- Because larger blocks heighten coherence traffic, selecting an optimal block size requires balancing miss rate reduction against the interconnect traffic generated by data transfers and directory messages.

My Knowledge Base

Explorer

2 Cache Coherence

Cache Coherence

The Coherence Problem

Coherence Properties