Advanced Address Translation

Virtual memory gives programmers a flat, linear address space, enabling portability, isolation, security, and multiprogramming. The cost is that every instruction fetch and data access may require virtual-to-physical translation. The central design problem is: how do we preserve the software benefits of virtual memory while minimizing the performance and energy cost of address translation?

The answer in modern chips is a hierarchy of translation structures: split L1 iTLB/dTLBs, unified private L2 TLBs, hardware page-table walkers, MMU caches for upper page-table levels, nested TLBs for virtualization, and data caches that incidentally cache page-table entries.

Evolution

Older systems trapped TLB misses into the OS: the pipeline drained, architectural state was saved, OS code walked the page table in software, installed the translation, and resumed the application. Modern processors add hardware page-table walkers and deeper TLB hierarchies to eliminate this cost. Per-core MMUs now contain L1 iTLB/dTLBs, L2 TLBs, PTWs, MMU caches, and virtualization-specific nested TLBs.

L1 TLBs

Split by type (instruction vs. data): Instruction fetch and data access occur concurrently; separate structures eliminate port contention. Instruction misses are more critical because they stall the front end directly, while some data misses can be hidden by out-of-order execution.

Split by page size: On x86-64, page sizes are 4 KB, 2 MB, and 1 GB, with offset widths of 12, 21, and 30 bits respectively. Which bits of the virtual address form the page number depends on the page size, and the page size is unknown before translation completes. A single fast set-associative L1 supporting all sizes is therefore difficult. The solution is separate L1 TLBs per page size probed in parallel; a translation resides in exactly one.

TLB Reach: A 10 GB working set requires ~2.6 million 4 KB translations but only 10 translations using 1 GB pages. Larger pages drastically increase the memory covered per TLB entry.

L2 TLBs

Unified across instruction and data, private to a core. Larger and slower than L1 but with much higher capacity. Key design dimensions: access time, hit rate, and multi-page-size support.

Inclusion policies between L1 and L2:

  • Mostly Inclusive: Both levels fill on a page-table-walk result; evictions are independent. A translation may remain in L1 after L2 evicts it.
  • Strictly Inclusive: L2 eviction sends a back-invalidation to L1. L2 acts as a filter for translation-coherence probes — absence in L2 proves absence in L1.
  • Exclusive: A filled translation goes only to L1; it moves to L2 on L1 eviction. Improves effective capacity but complicates movement policy.

Translation Coherence: When the OS modifies a page-table entry, all stale TLB copies across cores must be invalidated. Strict inclusion reduces L1 probe traffic because L2 absence implies L1 absence.

Multi-Page-Size Support in L2 TLBs

  • Hash-Rehashing: Probes assuming one page size, retries with another on miss. Simple but produces variable hit latency and slow miss detection.
  • Page-Size Prediction: Predicts the page size before lookup using PC or address bits. Reduces hit latency but costs area and can suffer aliasing.
  • Parallel Lookup: Probes all page-size interpretations simultaneously. Uniform hit latency and fast miss detection, but wastes energy since only one page size can be correct.
  • Parallel Page-Table Walks: Speculatively starts a walk while rehash probes continue, partially hiding miss-detection delay.
  • Skewed TLBs: Uses different hash functions per way to reduce conflicts; subsets of ways can be assigned to different page sizes. More complex hashing and reduced effective associativity as more page sizes are added.

Page-Table Walks

Software-Managed: TLB miss traps into the OS. Pipeline drains, architectural state is saved, OS searches the page table, fills the TLB, and resumes. Flexible (supports any page-table organization) but expensive — hundreds of cycles due to context switching, cache pollution, and branch predictor pollution.

Hardware-Managed: A per-core hardware page-table walker (PTW) contains a state machine for the architecture’s page-table format and MSHR-like buffers tracking outstanding translation misses. On x86-64, the walker uses CR3 as the root and extracts 9-bit indices through L4/L3/L2/L1 page-table levels.

Hardware walkers improve performance by:

  • Avoiding OS context switches on ordinary TLB misses.
  • Allowing out-of-order cores to execute independent instructions during a walk.
  • Supporting multiple concurrent TLB misses (important for large ROBs, SMT, and accelerators).

The cost is reduced flexibility: the walker is built for a specific page-table format.

Virtualization

In a virtualized system, a guest virtual address must first translate to a guest physical address, and then that guest physical address must translate to a system physical address. With four-level x86-64-style guest and nested page tables, the walker may need up to 24 memory references per translation — this is why virtualization receives specialized hardware support such as nested TLBs and MMU caches.

MMU Caches

TLBs cache leaf translations, but upper page-table levels also have high temporal locality (one L4 entry covers a huge virtual-address region). MMU caches exploit this by caching non-leaf entries close to the walker.

  • Page Walk Caches: Physically tagged. Cache page-table entries by physical address of the page-table location. Simple, but lookup is sequential: the L4 result is needed to form the L3 lookup, L3 for L2, and so on.
  • Paging Structure Caches: Virtually indexed/tagged by page-table indices from the virtual address. L4, L3, and L2 entries can be searched independently or in parallel; the walker selects the longest matching prefix. (Intel-style.)
  • Translation Path Caches: Compress a whole upper-level path into one entry, storing multiple intermediate physical page numbers together. Reduces redundancy compared to storing separate L4/L3/L2 entries that share prefixes.

MMU Integration with the Memory Hierarchy

Load/Store Queues: Page offset bits are identical before and after translation and can be used early to compare loads against older stores. Full physical-address comparison waits until TLB results return, enabling store-to-load forwarding and detection of mis-speculated memory dependencies.

VIPT L1 Caches: TLB lookup overlaps with cache set selection using page-offset bits. On x86-64 with 4 KB pages and 64-byte lines: 6 bits for block offset + at most 6 bits for index = 12 page-offset bits. This caps a straightforward VIPT L1 at 64 sets unless associativity increases or more advanced techniques are used.

TLB Miss Flow:

  1. L1 cache access may abort while the L2 TLB is probed.
  2. L2 TLB hit → fills L1 TLB, memory instruction replays.
  3. L2 TLB miss → PTW consults MMU caches, then the data-cache hierarchy (page tables are data structures).

Fault Types:

  • Protection Fault: Permissions disallow access. OS mechanisms such as copy-on-write may intentionally use write protection to trigger controlled page copying.
  • Minor Page Fault: Page is resident in memory but not yet mapped in the process page table.
  • Major Page Fault: Page must be fetched from secondary storage — much more expensive.

Translation Prefetching

  • Cache-Line Prefetching: A 64-byte cache line holds eight 8-byte PTEs on x86-64. A walk for one virtual page implicitly brings neighboring PTEs into cache. Standard cache prefetchers can also prefetch page-table cache lines.
  • Sequential TLB Prefetching: Fetches translations for adjacent virtual pages, often into a separate prefetch buffer to avoid polluting the TLB. Aggressive prefetching improves hit rate but creates bandwidth pressure.
  • Arbitrary-Stride Prefetching: Generalizes beyond ±1 page strides by tracking the stride behavior of memory instructions indexed by PC.
  • Recency-Based Prefetching: Maintains an LRU-like stack of translations across the TLB and page table. Page-table entries gain previous/next pointers so that a miss can prefetch translations adjacent in recency rather than virtual-address order. Captures temporal patterns but increases page-table storage and cache footprint.

Replay (Translation-Triggered) Prefetching

When a page-table walk reaches LLC or DRAM, the replayed memory instruction often also misses LLC or DRAM — a cold translation usually points to cold data. Replay prefetching exploits the gap between finding the PTE and replaying the instruction.

Hardware at the memory controller and PTW: when the controller reads a PTE from DRAM, it learns the physical frame number; the PTW supplies the cache-line offset. Combined, these prefetch the actual data line the replay will need — into the row buffer and LLC.

  • Not speculative about the target address: the replay address is known once the translation is found, reducing wrong-address pollution and wasted bandwidth.
  • Potential savings: ~100–150+ cycles when the replay becomes an LLC hit instead of a DRAM access.

Coalesced TLBs

Adjacent virtual pages often map to adjacent physical frames (translation contiguity). Superpages exploit maximum contiguity, but OSes struggle to create large contiguous regions on fragmented long-running systems. Intermediate contiguity — tens to hundreds of pages — often exists due to buddy allocators and compaction.

A coalesced TLB entry stores a base translation plus a count or bitmap for adjacent translations. On lookup, the offset from the base virtual page is applied to the base physical frame. Unlike superpages, strict 2MB/1GB alignment is not required.

  • Complete Sub-Blocked TLBs: Store multiple physical frame numbers per entry for a group of virtual pages. Higher reach, more storage cost.
  • Partial Sub-Blocked TLBs: Use alignment/offset restrictions and a bit vector for clustered mappings with imperfect sequential ordering.

Coalescing is done on the fill path, not the lookup path, to preserve lookup latency. When a walk fetches a cache line of PTEs, combinational logic scans for contiguous translations and fills a coalesced entry — simultaneously acting as translation prefetching.

MIX TLBs

MIX TLBs build set-associative structures that concurrently cache multiple page sizes. Small-page index bits are used for all translations, including superpages. Since those bits lie inside a superpage’s offset, a superpage must be mirrored into multiple sets.

Mirroring reduces effective capacity. MIX TLBs counteract this by coalescing contiguous superpages: if adjacent superpages are also contiguous in physical memory, the TLB can coalesce and mirror them efficiently. Performance depends on superpage contiguity; with 16–128 sets, having roughly that many contiguous superpages can offset the mirroring cost.