Core Cache Concepts

Caches buffer data to exploit the principle of locality, reducing average memory access time by keeping frequently used data closer to the processor. Memory requests that locate requested data in the cache are hits, while requests that fail to find data are misses. Cache operations are defined by four primary design questions:

Q1: Block Placement

Block placement determines where a fetched memory block can be stored in the cache.

Direct-Mapped: Each memory block has exactly one valid cache location.

Placement = (Block address) (mod (Number of blocks in cache))

Set Associative: The cache is divided into sets; a block maps to a single set but can be placed in any block frame within that set.

Set Index = (Block address) (mod (Number of sets in cache))

Fully Associative: A block can be placed in any block frame within the cache.

Q2: Block Identification

Caches identify stored blocks using physical or virtual address partitioning.

Address Fields: The processor address is partitioned into a Block Address (Tag + Index) and a Block Offset.
Tag: Used to search all blocks in the selected set in parallel.
Index: Used to select the specific cache set, minimizing the tag size. Fully associative caches do not use an index field.
Valid Bit: Indicates whether the matched cache entry contains valid data.

Q3: Block Replacement

On a cache miss, a block must be evicted to make room for the newly fetched block.

Least Recently Used (LRU): Evicts the block that has gone the longest without being accessed. Expensive to compute for high associativity.
Pseudo-LRU: Approximates LRU using a set of bits per set to track recently accessed ways, randomly replacing a block from an un-accessed way.
First In, First Out (FIFO): Evicts the oldest block in the set.
Random: Evicts a randomly chosen block; simple to implement in hardware.

Q4: Write Strategy

Writes account for approximately 28% of data cache traffic and require tag verification before modification to prevent corruption.

Write-Through: Data is written to both the cache block and the lower-level memory. Typically paired with a no-write allocate policy, where write misses update main memory directly without loading the block into the cache.
Write-Back: Data is written only to the cache. The block is written to main memory only upon eviction, requiring a dirty bit to track modification. Typically paired with a write allocate policy, where write misses fetch the block into the cache before modification.
Write Buffer: Used to prevent write stalls by overlapping processor execution with memory updates.

Cache Performance Metrics

Evaluating cache performance relies on memory stall cycles and average access times rather than instruction count alone.

Execution Time and Memory Stalls

CPU execution time factors in stalls caused by the memory hierarchy:

CPU execution time = (CPU clock cycles + Memory stall cycles) \times Clock cycle time

Memory stall cycles are calculated based on cache miss rates:

Memory stall cycles = Instruction count \times \frac{Memory accesses}{Instruction} \times Miss rate \times Miss penalty

Average Memory Access Time (AMAT)

AMAT provides a normalized metric for memory hierarchy efficiency:

AMAT = Hit time + Miss rate \times Miss penalty

Out-of-Order Execution Adjustments

Out-of-order processors can overlap memory latency with instruction execution. Miss penalty is redefined as the non-overlapped latency that stalls the processor:

\frac{Memory stall cycles}{Instruction} = \frac{Misses}{Instruction} \times (Total miss latency - Overlapped miss latency)

Miss Classification: The Three C’s Model

Cache misses are categorized into three fundamental types to guide optimization strategies:

Compulsory (Cold-Start): The first access to a block that has never been in the cache. These occur even in infinite-sized caches.
Capacity: Occurs when the cache cannot hold all the blocks needed during program execution, causing active blocks to be discarded and retrieved (thrashing).
Conflict (Collision): Occurs in set-associative or direct-mapped caches when multiple required blocks compete for the same set, leading to evictions that would not occur in a fully associative cache.

Six Basic Cache Optimizations

Reducing Miss Rate

Larger Block Size: Exploits spatial locality to reduce compulsory misses. Trade-offs: Increases miss penalty and may increase conflict/capacity misses if the block size is too large relative to cache size.
Larger Caches: Directly reduces capacity misses. Trade-offs: Increases hit time, manufacturing cost, and static/dynamic power consumption.
Higher Associativity: Reduces conflict misses. Follows the 2:1 Cache Rule of Thumb: A direct-mapped cache of size $N$ has roughly the same miss rate as a 2-way set-associative cache of size $N /2$ . Trade-offs: Multiplexing logic increases hit time and may slow down the processor clock.

Reducing Miss Penalty

Multilevel Caches: Introduces an L2 (or L3) cache to bridge the gap between fast L1 caches and slow main memory.
- AMAT for Two Levels:
$AMAT = Hit time_{L 1} + Miss rate_{L 1} \times (Hit time_{L 2} + Miss rate_{L 2} \times Miss penalty_{L 2})$
- Global vs. Local Miss Rate: The local miss rate ( $Miss rate_{L 2}$ ) is the fraction of misses in the L2 cache based only on L1 left-overs. The global miss rate ( $Miss rate_{L 1} \times Miss rate_{L 2}$ ) measures the fraction of total processor memory accesses that must go to main memory.
- Inclusion vs. Exclusion: Multilevel inclusion forces all L1 data to reside in L2, simplifying consistency checks. Multilevel exclusion ensures L1 data is never duplicated in L2, optimizing total capacity when L2 is only slightly larger than L1.
Giving Priority to Read Misses Over Writes: Allows a read miss to bypass queued writes in the write buffer, provided there are no address conflicts. This significantly reduces the read miss penalty and prevents stalls.

Reducing Hit Time

Avoiding Address Translation During Indexing: Uses the page offset (which is identical in both virtual and physical addresses) to index the cache directly.
- Virtually Indexed, Physically Tagged Caches: Tag comparison overlaps with address translation. Requires the direct-mapped cache size to be $\leq$ the page size to function natively.
- Page Coloring: Software enforces constraints so that aliases share specific address bits, effectively increasing the page offset and allowing larger virtually indexed caches without physical address duplication (aliasing).

The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access. The page size is 16 KiB. The TLB is two-way set associative with 256 entries. The L1 cache is a direct-mapped 16 KiB, and the L2 cache is a four-way set associative with a total of 4 MiB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address is 40 bits.

My Knowledge Base

Explorer

01 Cache