Advanced Optimizations for Memory Hierarchies

Advanced memory hierarchy optimizations address the processor-memory performance gap by targeting specific components of the average memory access time (AMAT) equation and system power consumption.

Optimizations are classified into five core categories:

Reducing hit time and power consumption: Smaller caches and multibanked structures.
Increasing cache bandwidth: Pipelined caches, multibanked caches, and nonblocking caches.
Reducing miss penalty: Critical word first, merging write buffers, and multibank/High Bandwidth Memory (HBM).
Reducing miss rate: Compiler optimizations and optimized associativity.
Reducing miss penalty or miss rate via parallelism: Hardware and compiler prefetching.

1. Pipelined L1 Caches with Virtual Indexing and Set Associativity

Pipelining the Level 1 (L1) cache enables higher processor clock rates at the cost of slightly increased cache latency (e.g., 4 cycles for an instruction cache access).

Standard Pipeline Stages:
1. Address decode to select the word line.
2. Access the SRAM array.
3. Compare tags against the physical tag from the Translation Lookaside Buffer (TLB) to determine hit/miss.
4. Multiplex the block and transmit data.
Way Prediction: Reduces the energy and hit time of set-associative caches to match direct-mapped caches. Extra bits predict the specific “way” (block within a set) of the next access. The multiplexor is set early, requiring only a single tag comparison per cycle. A misprediction requires checking the remaining blocks in the next cycle.
Victim Caches: Small caches that hold recently evicted lines to reduce the miss rate of direct-mapped L1 caches. Modern designs often rely on higher L1 associativity and prefetching instead.

2. Multiple Banks and Ports to Increase L1 Data Cache Bandwidth

Multiple-issue processors require data caches to service multiple memory references per clock cycle.

Banking: The cache is divided into independent banks. Block addresses are interleaved across banks sequentially. This allows multiple parallel accesses provided they target different banks.
Dual-Porting vs. Banking: Dual-porting directly increases bandwidth but increases cache access time and area. A multibanked cache with dual ports on each bank reduces the probability of collisions (three concurrent references to the same bank are required to cause a stall).

3. Advanced Replacement Policies (Pseudo-LRU)

Higher associativity requires sophisticated replacement policies to minimize miss rates without the hardware overhead of strict Least Recently Used (LRU) tracking.

Not Recently Used (NRU): A 1-bit per element pseudo-LRU scheme. The bit is cleared on access. On a miss, a block with an active NRU bit is replaced, and all other bits in the set are reset. Replaces randomly if all bits are active.
Recently Referenced Prediction Counter: Uses a saturating counter (e.g., 2-bit) per block. Set to 0 on a hit. On a miss, if any block equals the maximum value (e.g., 3), it is replaced. Otherwise, all counters increment. This explicitly distinguishes streaming accesses (used once) from data with high temporal locality.

4. Multibanked L2 and L3 Caches

Extending banking to Last Level Caches (LLC, such as L2 and L3) improves performance and power characteristics.

Power Reduction: Accessing a single bank instead of the entire cache array reduces dynamic power consumption.
Bandwidth: Increases the refill bandwidth necessary to support nonblocking L1/L2 caches.

5. Nonblocking Caches to Increase Cache Bandwidth

Out-of-order (OOO) processors do not need to stall immediately on a data cache miss. Nonblocking (or lockup-free) caches continue to supply data for subsequent hits during a miss (“hit under miss”).

Miss Under Miss: Modern memory systems overlap multiple outstanding misses to further hide latency.
Miss Status Handling Registers (MSHR): Hardware structure used to track outstanding misses. MSHRs record the requesting instruction, the destination physical register, and the cache block location. When data returns, the MSHR routes the data, updates the cache, and notifies the scheduler to wake up dependent instructions.

6. Critical Word First and Early Restart

Processors typically require a single word from a block to continue execution. These optimizations bypass waiting for the full block transfer to reduce the effective miss penalty.

Critical Word First: The memory system prioritizes fetching the specifically requested word, passing it immediately to the processor, and filling the remainder of the block in the background.
Early Restart: The block is fetched in normal sequential order. Once the requested word arrives, it is forwarded to the processor to resume execution.

7. Compiler Optimizations to Reduce Miss Rate

Compilers reorder instructions and data accesses to explicitly improve spatial and temporal locality without hardware changes.

Loop Interchange: Swapping nested loops to align the access pattern with the memory layout. For example, changing a column-major access pattern to a row-major access pattern ensures sequential words are accessed within the same cache block, directly reducing capacity misses.
Cache Blocking: Operating on submatrices ( $B \times B$ ) instead of entire rows or columns. This keeps a subset of the data continuously resident in the cache, maximizing temporal locality. For multiplying $N \times N$ matrices, blocking reduces memory accesses from $2 N^{3} + N^{2}$ to $\frac{2 N ^{3}}{B} + N^{2}$ .

8. Hardware Prefetching

Hardware predicts future data and instruction needs, fetching them into the cache or a separate buffer before they are requested.

Instruction Prefetch: Fetches the requested block and the subsequent sequential block. The sequential block is placed in an instruction stream buffer.
Data Prefetch: L2 data prefetchers often fetch data directly from L3 or memory into the L2 cache. Effective prefetching converts L3 misses into L2 hits, dramatically reducing latency.
Trade-offs: Prefetching increases overall memory traffic. Inaccurate prefetches displace useful data (cache pollution) and waste bandwidth, negatively impacting power consumption.

9. Compiler-Controlled Prefetching

Compilers explicitly insert prefetch instructions to load data ahead of use.

Prefetch Types: Can be directed to a register or directly into the cache.
Non-faulting (Non-binding): Modern implementations use non-faulting instructions; if the prefetch generates a virtual memory fault or protection violation, it safely executes as a no-op.
Implementation: The compiler unrolls loops and schedules prefetch instructions to overlap the memory latency with useful execution. Requires nonblocking caches.

10. Multiple Memory Buses and HBM Caches

High-speed processors use multiple memory channels to increase bandwidth and prevent independent accesses from stalling one another.

NonUniform Memory Architecture (NUMA): Systems mix standard SDRAM and High Bandwidth Memory (HBM). Software or domain-specific languages explicitly manage data placement in the faster HBM.
HBM as Last Level Cache (L4): Using HBM as an massive cache introduces a tag storage problem. Storing tags in SRAM requires enormous area (e.g., 96 MiB of tags for a 1 GiB cache).
Alloy Cache: Molds the tag and data together within the HBM using a direct-mapped structure. A single HBM burst transfers both tag and data, completing the LLC access in one cycle. A miss predictor avoids the latency of the initial HBM tag access if the request is highly likely to miss.

My Knowledge Base

Explorer

04 Optimizations