Intel Core i9 12900 (Alder Lake) Memory Hierarchy Design

Architecture Overview

Microarchitecture: Alder Lake (big.little design).
Core Configuration: 8 Performance cores (P-cores) and 8 Efficiency cores (E-cores). P-cores support multithreading and reach peak clock rates of 5.1 GHz in Turbo Boost mode.
Bandwidth Demands: Capable of generating up to four 128-bit data memory references per core per clock cycle. At 3 GHz, 8 cores demand a peak bandwidth of 3840 GiB/s.
Main Memory Interface: Dual memory channels supporting DDR4 or DDR5. Maximum bandwidth with DDR5-4800 is 77 GB/s.

Address Space and Translation Lookaside Buffer (TLB)

Address Space: 48-bit virtual address space mapped to a 36-bit physical address space.
Translation Hierarchy: Utilizes a two-level TLB structure.
- Instruction TLB (L1 I-TLB):
  - 256 entries for 4 KiB pages; 32 entries for 2/4 MiB pages.
  - 8-way set associative; Pseudo-LRU replacement.
  - 1-cycle access latency.
- Data TLB (L1 D-TLB):
  - 96 entries for 4 KiB pages; 32 entries for 2/4 MiB pages; 8 entries for 1 GiB pages.
  - 6-way, 4-way, and 8-way set associative (varies by page size); Pseudo-LRU replacement.
  - 1-cycle access latency. Includes separate structures for loads and stores.
- Second-Level TLB (STLB):
  - 2048 entries supporting 4 KiB, 2 MiB, and 1 GiB pages.
  - 16-way set associative; Pseudo-LRU replacement.
  - 8-cycle access latency (9-cycle total miss penalty from L1 TLB).
Page Table Walker: Hardware-based page table walker handles STLB misses, supporting up to four parallel page table walks.

Cache Hierarchy Specifications

All cache levels utilize a 64-byte block size and employ write-back policies (where applicable).

L1 Instruction Cache (L1I):
- Size: 32 KiB per core.
- Associativity: 8-way set associative.
- Latency: 4 clock cycles.
- Addressing: Virtually indexed, physically tagged.
L1 Data Cache (L1D):
- Size: 48 KiB per core.
- Associativity: 6-way set associative (dual-ported).
- Latency: 5 clock cycles.
- Addressing: Virtually indexed, physically tagged.
L2 Cache:
- Size: 1.25 MiB per P-core.
- Associativity: 10-way set associative.
- Latency: 15 clock cycles.
- Addressing: Physically indexed, physically tagged.
- Replacement: Weighted n-bit LRU.
L3 Cache (Last Level Cache - LLC):
- Size: 30 MiB shared across all cores.
- Structure: Distributed into 8 banks. A hash function maps addresses to specific banks.
- Associativity: 15-way set associative.
- Latency: 50 clock cycles.
- Inclusion Policy: Non-inclusive. Primarily holds blocks ejected from the L2 cache.

Memory Access Flow and Address Calculations

The 64-bit virtual address resolves to a 36-bit physical address. Cache access relies on splitting the physical address into Tag, Index, and Block Offset.

Block Offset: All caches use 64-byte blocks, requiring a 6-bit offset ( $2^{6} = 64$ ).
L1 Instruction Cache Access:
- Index:
  $Index Bits = lo g_{2} (\frac{32 KiB}{64 bytes \times 8 -way}) = lo g_{2} (64) = 6 bits$
- Tag:
  $Tag Bits = 36 (physical address) - 6 (index) - 6 (offset) = 24 bits$
- Fetch Width: The fetch unit retrieves 32 bytes per cycle. It uses 1 additional bit from the 6-bit block offset to select the correct 32-byte chunk.
L2 Cache Access:
- Index:
  $Index Bits = lo g_{2} (\frac{1.25 MiB}{64 bytes \times 10 -way}) = lo g_{2} (2048) = 11 bits$
- Tag:
  $Tag Bits = 36 - 11 - 6 = 19 bits$
L3 Cache Access (Per 3.75 MiB Bank):
- Index:
  $Index Bits = lo g_{2} (\frac{3.75 MiB}{64 bytes \times 15 -way}) = lo g_{2} (4096) = 12 bits$
- Tag:
  $Tag Bits = 36 - 12 - 6 = 18 bits$

Cache Miss and Write Management

L1 Data Cache Write Strategy:
- Utilizes a no-write-allocate policy on write misses.
- Store misses bypass cache allocation and are placed directly into a merging write buffer.
Merging Write Buffers:
- Captures dirty cache lines and unallocated write misses.
- Writes data back to the next memory level when that level is not actively serving a read request.
- Checked concurrently during a cache miss; if the requested line resides in the write buffer, the miss is filled directly from the buffer.
L3 and Main Memory Interaction:
- A miss at the L3 cache initiates a main memory access.
- Miss Penalty: ~50 cycles to detect the L3 miss + ~160 cycles DRAM latency (DDR5-4800) $\approx$ 200 cycles for the first 16 bytes. The remaining 48 bytes transfer at 32 bytes/cycle, taking 10 additional cycles.
- Due to the non-inclusive policy, data fetched from main memory is written directly into L1 and L2 caches, bypassing L3 insertion.

Prefetching and Performance Metrics

Hardware Prefetching: Supported at both L1 and L2 levels, predicting and fetching data from the next level in the hierarchy to hide latency.
Miss Rates (MPKI):
- Instruction cache miss rates remain highly optimized (<1% for most integer workloads).
- Memory-intensive workloads generate significant L1D misses (>20 MPKI) and L2 misses (>10 MPKI), placing heavy reliance on the L3 cache and hardware prefetchers.

My Knowledge Base

Explorer

08 Intel Core i9 12900