03 Directory

Maintaining Cache Coherence with Directories

Snooping protocols require broadcast communication on every cache miss, which limits scalability in distributed-memory multiprocessors. Directory protocols resolve this by maintaining the sharing status of every cached physical memory block in a single, known location called a directory.

Single Multicore (Inclusive LLC): The directory is implemented as a bit vector per Last Level Cache (LLC) block, indicating which private L2 caches hold copies.
Single Multicore (Non-inclusive LLC): Directory information is maintained in a separate hardware structure or via a copied set of the L2 tags.
Multichip NUMA Systems: Directories are distributed alongside physical memory, ensuring that different coherence requests route to different memory interfaces.

To manipulate these directory structures without broadcasting, systems rely on a targeted, message-based communication protocol.

Directory Protocol Basics

The directory tracks both the state of each block and the specific nodes holding copies to eliminate broadcast requirements.

Node Classifications:
- Local node: The node originating the memory request.
- Home node: The node housing the physical memory location and its corresponding directory entry.
- Remote node: A node currently holding a cached copy of the block, either in a shared or exclusive state.
Message Types:
- Requests (Local $\to$ Home): Read miss, Write miss.
- Data responses (Home $\to$ Local, Remote $\to$ Home): Data value replies, Data write-backs.
- Interventions (Home $\to$ Remote): Invalidate, Fetch, Fetch/invalidate.
- Acknowledgments: Essential for tracking the completion of invalidations before allowing a write to proceed, thereby enforcing memory consistency.
Core Directory States:
- Shared: One or more nodes cache the block, and the main memory copy is up to date.
- Uncached: No nodes currently cache the block.
- Modified (Exclusive): Exactly one node (the owner) caches the block and has modified it, rendering main memory out of date.
Sharers Set: A data structure—typically a bit vector for systems with $< 64$ nodes—recording which specific nodes currently cache the block.

The interaction between these node types and message classes defines the state machine governing block transitions at the directory.

Directory State Transitions

Directory actions are triggered by external messages, specifically read misses, write misses, and data write-backs.

Uncached State Transitions:
- Read miss: The directory sends data to the requesting node, adds the node to the Sharers set, and transitions to the Shared state.
- Write miss: The directory sends data to the requesting node, sets the Sharers set to identify this node as the exclusive owner, and transitions to the Exclusive state.
Shared State Transitions:
- Read miss: The directory sends data to the requesting node and adds the node to the Sharers set.
- Write miss: The directory sends data to the requesting node, transmits invalidate messages to all nodes currently in the Sharers set, updates Sharers to contain only the requesting node, and transitions to the Exclusive state.
Exclusive State Transitions:
- Read miss: The directory sends a data fetch message to the owner. The owner transitions its local cache state to Shared and sends the data back to the directory. The directory updates main memory, sends the data to the requester, adds the requester to the Sharers set, and transitions to the Shared state.
- Write miss: The directory sends a fetch/invalidate message to the old owner. The old owner invalidates its local copy and sends the data to the directory, which routes it to the new requester. The Sharers set is updated to the new owner, and the directory state remains Exclusive.
- Data write-back: Triggered when the owner replaces the block. The directory updates main memory, clears the Sharers set, and transitions to the Uncached state.

While these state transitions form the theoretical foundation, physical implementation in modern systems often requires hybrid approaches scaling across multiple cache levels.

Multicore and Hybrid Implementations

Modern multichip architectures utilize a hybrid of snooping and directory-based coherence to manage complexity and bandwidth.

Intra-chip Coherence: Hardware at the LLC level maintains coherence among the private core caches. If the LLC is inclusive, directory information associates directly with the LLC block tags.
Inter-chip Coherence: Directories are located at the memory interface to track which socket LLCs hold copies of a block.
Directory Caching: Directory information can be cached to reduce physical storage overhead. A directory cache miss is handled conservatively by assuming the block is present and broadcasting invalidates to all clusters.
Snoop-Directory Reversal: In architectures like AMD EPYC, snooping occurs at the LLC directories. An LLC directory hit triggers explicit invalidates only to the specific L2 caches holding copies.

The efficiency of these hybrid directory implementations heavily influences system performance, particularly under the memory demands of large-scale NUMA workloads.

Performance Characteristics in NUMA Systems

NUMA architectures utilizing distributed directories are evaluated using highly parallel workloads, such as WWW search indexing.

Workload Characteristics (WWW Search):
- Exhibits high request-level parallelism with minimal interprocess communication, yielding near-linear speedup with increasing core counts.
- System performance is strictly bottlenecked by capacity misses in the LLC, as true sharing and conflict misses are negligible.
- L3 misses must access off-chip DRAM, consuming significantly more latency and energy ( $> 10 \times$ ) than L3 hits.
Block Size Optimizations:
- Increasing cache block size decreases the capacity miss rate by exploiting spatial locality, but simultaneously increases interconnect traffic.
- L1 Instruction and L3 caches see significant miss rate reductions with larger blocks (e.g., $128$ B or $256$ B).
- L1 Data cache miss rates drop initially up to $64$ B blocks, but increase at larger sizes.
- Because larger blocks heighten coherence traffic, selecting an optimal block size requires balancing miss rate reduction against the interconnect traffic generated by data transfers and directory messages.

My Knowledge Base

Explorer