Large and Fast: Exploiting Memory Hierarchy

The Principle of Locality and Memory Hierarchy

Computer programs do not access all code or data simultaneously with equal probability. Memory design relies on the principle of locality to provide the illusion of unbounded fast memory:

Temporal locality: If a data location is referenced, it will tend to be referenced again soon (e.g., loop variables).
Spatial locality: If a data location is referenced, nearby addresses will tend to be referenced soon (e.g., sequential array elements).

To exploit locality, memory is organized as a memory hierarchy consisting of multiple levels.

Higher levels (closer to the processor) are smaller, faster, and more expensive per bit.
Lower levels are larger, slower, and cheaper per bit.
Block (or line): The minimum unit of information transferred between two adjacent levels.
Hit: Requested data is found in the upper level. Hit rate is the fraction of accesses found; hit time is the time to access the upper level and determine the hit/miss status.
Miss: Requested data is not found in the upper level. Miss rate ( $1 - hit rate$ ) is the fraction of accesses not found. Miss penalty is the time required to fetch the block from the lower level, insert it, and pass it to the requestor.

The effectiveness of this hierarchical structuring is strictly governed by the physical properties of the underlying storage hardware.

Memory Technologies

Memory hierarchies are built from four primary technologies, each offering distinct density, cost, and speed tradeoffs:

Static Random Access Memory (SRAM): Utilizes 6 to 8 transistors per bit to store data on a pair of inverting gates. SRAM retains data indefinitely as long as power is applied, requires no refresh, and has access times close to the cycle time (0.5–2.5 ns). Used primarily for caches.
Dynamic Random Access Memory (DRAM): Utilizes a single transistor and a capacitor per bit, making it highly dense and cheap but requiring periodic refreshing (reading and rewriting the charge) to prevent data loss.
- Organized via two-level decoding: a row access (RAS) activates a word line, storing columns in latches, followed by a column access (CAS).
- Synchronous DRAM (SDRAM) transfers bursts of sequential data synchronized to a clock signal. Double Data Rate (DDR) SDRAM transfers data on both the rising and falling clock edges.
- Typically packaged in Dual Inline Memory Modules (DIMMs), mapped into interconnected ranks and banks to sustain high bandwidth.
Flash Memory: A nonvolatile Electrically Erasable Programmable Read-Only Memory (EEPROM). Writes wear out flash memory bits (100,000 to 1,000,000 writes), necessitating hardware controllers that perform wear leveling to dynamically remap frequently written blocks.
Magnetic Disk: Mechanical nonvolatile storage using rotating platters and read-write heads. Access time involves seek time (moving the head to the track), rotational latency (waiting for the sector to pass under the head), and transfer time.

SRAM’s exceptional speed makes it the foundation for the highest level of the memory hierarchy: the cache.

Cache Fundamentals and Addressing

A cache intercepts processor memory requests. In a direct-mapped cache, each memory block maps to exactly one cache location, typically computed using modulo arithmetic: $(Bl oc k_{a} dd ress) mod (N u mb e r_{o} f_{b} l oc k s_{i} n_{c} a c h e)$ .

To verify cache contents, the memory address is divided into three fields:

Cache Index: The low-order bits used to select the specific cache block.
Tag: The high-order bits compared against the cache’s stored tag to determine if the resident block matches the requested memory address.
Block Offset: The lowest bits used to select the specific byte or word within a multiword cache block.
Valid Bit: A single hardware bit per entry indicating whether the cache block contains meaningful, initialized data.

Handling writes introduces cache-memory inconsistency. Two primary write policies resolve this:

Write-through: Data is written simultaneously to the cache and the lower-level memory. To prevent severe processor stalls, a write buffer holds data asynchronously while memory is updated.
Write-back: Data is written only to the cache. A dirty bit tracks modifications. The block is written to main memory only when it is evicted/replaced.

While individual cache mechanics dictate base operating speed, evaluating overall processor efficiency requires integrating cache metrics into global performance equations.

Measuring and Improving Cache Performance

Cache misses stall the processor pipeline. Overall CPU time is calculated by factoring in these stalls:

$CP U_{t} im e = (CP U_{e} x ec u t i o n_{c} l oc k_{c} yc l es + M e m or y_{s} t a l l_{c} l oc k_{c} yc l es) \times Cl oc k_{c} yc l e_{t} im e$ $M e m or y_{s} t a l l_{c} l oc k_{c} yc l es = \frac{M i sses}{I n s t r u c t i o n} \times I n s t r u c t i o n_{c} o u n t \times M i s s_{p} e na lt y$

To evaluate alternative cache designs independent of CPU execution time, designers utilize Average Memory Access Time (AMAT):

$AMAT = Time_for_a_hit + Miss_rate \times Miss_penalty$

Cache design manipulates placement flexibility (associativity) to reduce miss rates:

Direct-mapped (1-way): One location per block. High collision risk, fast hit time.
Set-associative (n-way): The cache is divided into sets; a block maps to a specific set (via the index) but can be placed in any of the $n$ blocks within that set. Requires $n$ parallel tag comparators.
Fully associative: A block can be placed anywhere. Requires full parallel comparison of all cache tags, making it practical only for very small caches or TLBs.

When an associative cache is full, a replacement policy determines eviction. The Least Recently Used (LRU) policy evicts the block that has been unused for the longest time, exploiting temporal locality.

Multilevel Caches decouple conflicting cache design goals:

L1 Caches: Small, direct-mapped or low-associativity to minimize hit time and match fast processor clock cycles.
L2/L3 Caches: Much larger, highly associative to minimize miss rate and avoid the massive penalty of DRAM access. Evaluated via the global miss rate (misses in all cache levels) rather than the local miss rate.

Software Optimization (Blocking): Programmers improve cache performance by restructuring algorithms. Cache blocking operates on submatrices (blocks) rather than entire rows or columns, maximizing temporal and spatial locality before data is evicted, heavily benefiting dense matrix multiplication.

While caches accelerate main memory access, virtual machines and virtual memory abstract and protect the entire physical address space.

Virtual Machines and Protection

System Virtual Machines present the illusion that a user has an entire computer, allowing multiple distinct operating systems to share one physical hardware platform.

The Virtual Machine Monitor (VMM) or hypervisor manages resource mapping, executing at a higher privilege level than the guest OS.
Hardware must support at least two modes (user and supervisor/kernel), provide read-only access to protected state for user processes, and generate exceptions (traps) when a guest OS attempts privileged operations.

Virtual machines rely inherently on memory protection mechanisms, which are implemented via the virtual memory subsystem.

Virtual Memory and the TLB

Virtual memory uses main memory (DRAM) as a cache for secondary storage (disk/flash). It provides automatic relocation, isolates active processes from one another, and provides the illusion of memory exceeding physical hardware constraints.

Pages and Page Faults: Virtual memory blocks are called pages. A miss is a page fault, transferring control to the OS to fetch the page from the disk’s swap space.
Page Table: A data structure residing in main memory that maps Virtual Page Numbers to Physical Page Numbers. Because page faults are enormously expensive, virtual memory utilizes fully associative placement (a virtual page can map to any physical page) and a write-back policy.
Hierarchical Page Tables: To avoid allocating massive contiguous page tables (e.g., 64 billion entries for a 48-bit address space), architectures like RISC-V use multi-level tree structures, paging the page tables themselves.

Translation-Lookaside Buffer (TLB) Because every memory access requires an address translation, reading the page table directly would halve processor speed. The TLB is a dedicated hardware cache containing recently used virtual-to-physical translations.

On a TLB hit, the physical address is yielded in a single cycle.
On a TLB miss, the hardware or OS walks the page table. If the page table indicates the page is on disk, a page fault exception is generated.

Virtually Addressed Caches vs. Physically Addressed Caches

If a cache is physically indexed and physically tagged, the TLB must complete translation before cache lookup.
A virtually addressed cache uses virtual indices and tags, moving the TLB out of the critical path. However, this introduces aliasing, where two different virtual addresses map to the same physical page, potentially allowing silent incoherence between shared views.
Modern compromise: Virtually indexed, physically tagged caches use the un-translated page offset to index the cache array in parallel with the TLB lookup, then use the TLB’s physical output to compare against the physical tag.

The principles dictating pages, cache blocks, and TLB entries can be mathematically generalized into a single evaluative framework.

The Three Cs Model

All misses in any level of a memory hierarchy can be classified into three intuitive categories:

Compulsory misses (Cold-start): The very first access to a block that has never been loaded. Mitigated by larger block sizes or prefetching.
Capacity misses: The cache cannot contain all the blocks needed during the execution of a program, forcing evictions of blocks that will be needed later. Mitigated by increasing total cache size.
Conflict misses (Collision): Occur in direct-mapped or set-associative caches when multiple active blocks map to the exact same set, causing contention. Mitigated by increasing associativity.

This hierarchical caching logic breaks down when multiple processing cores attempt to mutate the same physical memory space independently.

Cache Coherence

In a multicore shared-memory multiprocessor, each processor has its own private L1/L2 cache. If Processor A writes to memory location $X$ , and Processor B subsequently reads $X$ , Processor B may read a stale value from its private cache. This is the cache coherence problem.

Coherence ensures that any read to a data item returns the most recently written value. It requires write serialization: all writes to the same location must be seen in the identical order by all processors.
Consistency determines when a written value becomes visible to other processors, governed by the memory consistency model.

Snooping Protocols: The most common hardware solution. All cache controllers monitor (snoop) a broadcast medium (bus or network) to track the sharing status of blocks.

When a processor writes to a shared block, it broadcasts an invalidation signal.
Other caches snooping the address automatically invalidate their copies. The writing processor becomes the exclusive owner.
If a processor encounters a read miss on an address modified exclusively in another processor’s cache, the owning cache intercepts the read, supplies the dirty data, and changes its state back to shared.
False sharing occurs when two unrelated shared variables reside in the exact same cache block, causing the block to bounce back and forth between processors (thrashing) even though no actual data is shared.

My Knowledge Base

Explorer

05 Large and Fast: Exploiting Memory Hierarchy