Intel Core i9 12900 (Alder Lake) Memory Hierarchy Design

Architecture Overview

  • Microarchitecture: Alder Lake (big.little design).
  • Core Configuration: 8 Performance cores (P-cores) and 8 Efficiency cores (E-cores). P-cores support multithreading and reach peak clock rates of 5.1 GHz in Turbo Boost mode.
  • Bandwidth Demands: Capable of generating up to four 128-bit data memory references per core per clock cycle. At 3 GHz, 8 cores demand a peak bandwidth of 3840 GiB/s.
  • Main Memory Interface: Dual memory channels supporting DDR4 or DDR5. Maximum bandwidth with DDR5-4800 is 77 GB/s.

Address Space and Translation Lookaside Buffer (TLB)

  • Address Space: 48-bit virtual address space mapped to a 36-bit physical address space.
  • Translation Hierarchy: Utilizes a two-level TLB structure.
    • Instruction TLB (L1 I-TLB):
      • 256 entries for 4 KiB pages; 32 entries for 2/4 MiB pages.
      • 8-way set associative; Pseudo-LRU replacement.
      • 1-cycle access latency.
    • Data TLB (L1 D-TLB):
      • 96 entries for 4 KiB pages; 32 entries for 2/4 MiB pages; 8 entries for 1 GiB pages.
      • 6-way, 4-way, and 8-way set associative (varies by page size); Pseudo-LRU replacement.
      • 1-cycle access latency. Includes separate structures for loads and stores.
    • Second-Level TLB (STLB):
      • 2048 entries supporting 4 KiB, 2 MiB, and 1 GiB pages.
      • 16-way set associative; Pseudo-LRU replacement.
      • 8-cycle access latency (9-cycle total miss penalty from L1 TLB).
  • Page Table Walker: Hardware-based page table walker handles STLB misses, supporting up to four parallel page table walks.

Cache Hierarchy Specifications

All cache levels utilize a 64-byte block size and employ write-back policies (where applicable).

  • L1 Instruction Cache (L1I):
    • Size: 32 KiB per core.
    • Associativity: 8-way set associative.
    • Latency: 4 clock cycles.
    • Addressing: Virtually indexed, physically tagged.
  • L1 Data Cache (L1D):
    • Size: 48 KiB per core.
    • Associativity: 6-way set associative (dual-ported).
    • Latency: 5 clock cycles.
    • Addressing: Virtually indexed, physically tagged.
  • L2 Cache:
    • Size: 1.25 MiB per P-core.
    • Associativity: 10-way set associative.
    • Latency: 15 clock cycles.
    • Addressing: Physically indexed, physically tagged.
    • Replacement: Weighted n-bit LRU.
  • L3 Cache (Last Level Cache - LLC):
    • Size: 30 MiB shared across all cores.
    • Structure: Distributed into 8 banks. A hash function maps addresses to specific banks.
    • Associativity: 15-way set associative.
    • Latency: 50 clock cycles.
    • Inclusion Policy: Non-inclusive. Primarily holds blocks ejected from the L2 cache.

Memory Access Flow and Address Calculations

The 64-bit virtual address resolves to a 36-bit physical address. Cache access relies on splitting the physical address into Tag, Index, and Block Offset.

  • Block Offset: All caches use 64-byte blocks, requiring a 6-bit offset ().
  • L1 Instruction Cache Access:
    • Index:

    • Tag:

    • Fetch Width: The fetch unit retrieves 32 bytes per cycle. It uses 1 additional bit from the 6-bit block offset to select the correct 32-byte chunk.

  • L2 Cache Access:
    • Index:

    • Tag:

  • L3 Cache Access (Per 3.75 MiB Bank):
    • Index:

    • Tag:

Cache Miss and Write Management

  • L1 Data Cache Write Strategy:
    • Utilizes a no-write-allocate policy on write misses.
    • Store misses bypass cache allocation and are placed directly into a merging write buffer.
  • Merging Write Buffers:
    • Captures dirty cache lines and unallocated write misses.
    • Writes data back to the next memory level when that level is not actively serving a read request.
    • Checked concurrently during a cache miss; if the requested line resides in the write buffer, the miss is filled directly from the buffer.
  • L3 and Main Memory Interaction:
    • A miss at the L3 cache initiates a main memory access.
    • Miss Penalty: ~50 cycles to detect the L3 miss + ~160 cycles DRAM latency (DDR5-4800) 200 cycles for the first 16 bytes. The remaining 48 bytes transfer at 32 bytes/cycle, taking 10 additional cycles.
    • Due to the non-inclusive policy, data fetched from main memory is written directly into L1 and L2 caches, bypassing L3 insertion.

Prefetching and Performance Metrics

  • Hardware Prefetching: Supported at both L1 and L2 levels, predicting and fetching data from the next level in the hierarchy to hide latency.
  • Miss Rates (MPKI):
    • Instruction cache miss rates remain highly optimized (<1% for most integer workloads).
    • Memory-intensive workloads generate significant L1D misses (>20 MPKI) and L2 misses (>10 MPKI), placing heavy reliance on the L3 cache and hardware prefetchers.