Intel Core i9 12900 (Alder Lake) Memory Hierarchy Design
Architecture Overview
- Microarchitecture: Alder Lake (big.little design).
- Core Configuration: 8 Performance cores (P-cores) and 8 Efficiency cores (E-cores). P-cores support multithreading and reach peak clock rates of 5.1 GHz in Turbo Boost mode.
- Bandwidth Demands: Capable of generating up to four 128-bit data memory references per core per clock cycle. At 3 GHz, 8 cores demand a peak bandwidth of 3840 GiB/s.
- Main Memory Interface: Dual memory channels supporting DDR4 or DDR5. Maximum bandwidth with DDR5-4800 is 77 GB/s.
Address Space and Translation Lookaside Buffer (TLB)
- Address Space: 48-bit virtual address space mapped to a 36-bit physical address space.
- Translation Hierarchy: Utilizes a two-level TLB structure.
- Instruction TLB (L1 I-TLB):
- 256 entries for 4 KiB pages; 32 entries for 2/4 MiB pages.
- 8-way set associative; Pseudo-LRU replacement.
- 1-cycle access latency.
- Data TLB (L1 D-TLB):
- 96 entries for 4 KiB pages; 32 entries for 2/4 MiB pages; 8 entries for 1 GiB pages.
- 6-way, 4-way, and 8-way set associative (varies by page size); Pseudo-LRU replacement.
- 1-cycle access latency. Includes separate structures for loads and stores.
- Second-Level TLB (STLB):
- 2048 entries supporting 4 KiB, 2 MiB, and 1 GiB pages.
- 16-way set associative; Pseudo-LRU replacement.
- 8-cycle access latency (9-cycle total miss penalty from L1 TLB).
- Instruction TLB (L1 I-TLB):
- Page Table Walker: Hardware-based page table walker handles STLB misses, supporting up to four parallel page table walks.
Cache Hierarchy Specifications
All cache levels utilize a 64-byte block size and employ write-back policies (where applicable).
- L1 Instruction Cache (L1I):
- Size: 32 KiB per core.
- Associativity: 8-way set associative.
- Latency: 4 clock cycles.
- Addressing: Virtually indexed, physically tagged.
- L1 Data Cache (L1D):
- Size: 48 KiB per core.
- Associativity: 6-way set associative (dual-ported).
- Latency: 5 clock cycles.
- Addressing: Virtually indexed, physically tagged.
- L2 Cache:
- Size: 1.25 MiB per P-core.
- Associativity: 10-way set associative.
- Latency: 15 clock cycles.
- Addressing: Physically indexed, physically tagged.
- Replacement: Weighted n-bit LRU.
- L3 Cache (Last Level Cache - LLC):
- Size: 30 MiB shared across all cores.
- Structure: Distributed into 8 banks. A hash function maps addresses to specific banks.
- Associativity: 15-way set associative.
- Latency: 50 clock cycles.
- Inclusion Policy: Non-inclusive. Primarily holds blocks ejected from the L2 cache.
Memory Access Flow and Address Calculations
The 64-bit virtual address resolves to a 36-bit physical address. Cache access relies on splitting the physical address into Tag, Index, and Block Offset.
- Block Offset: All caches use 64-byte blocks, requiring a 6-bit offset ().
- L1 Instruction Cache Access:
-
Index:
-
Tag:
-
Fetch Width: The fetch unit retrieves 32 bytes per cycle. It uses 1 additional bit from the 6-bit block offset to select the correct 32-byte chunk.
-
- L2 Cache Access:
-
Index:
-
Tag:
-
- L3 Cache Access (Per 3.75 MiB Bank):
-
Index:
-
Tag:
-
Cache Miss and Write Management
- L1 Data Cache Write Strategy:
- Utilizes a no-write-allocate policy on write misses.
- Store misses bypass cache allocation and are placed directly into a merging write buffer.
- Merging Write Buffers:
- Captures dirty cache lines and unallocated write misses.
- Writes data back to the next memory level when that level is not actively serving a read request.
- Checked concurrently during a cache miss; if the requested line resides in the write buffer, the miss is filled directly from the buffer.
- L3 and Main Memory Interaction:
- A miss at the L3 cache initiates a main memory access.
- Miss Penalty: ~50 cycles to detect the L3 miss + ~160 cycles DRAM latency (DDR5-4800) 200 cycles for the first 16 bytes. The remaining 48 bytes transfer at 32 bytes/cycle, taking 10 additional cycles.
- Due to the non-inclusive policy, data fetched from main memory is written directly into L1 and L2 caches, bypassing L3 insertion.
Prefetching and Performance Metrics
- Hardware Prefetching: Supported at both L1 and L2 levels, predicting and fetching data from the next level in the hierarchy to hide latency.
- Miss Rates (MPKI):
- Instruction cache miss rates remain highly optimized (<1% for most integer workloads).
- Memory-intensive workloads generate significant L1D misses (>20 MPKI) and L2 misses (>10 MPKI), placing heavy reliance on the L3 cache and hardware prefetchers.