15 Intel Golden Clove

Intel Golden Cove Microarchitecture

Architecture Overview

Golden Cove operates as an aggressive out-of-order (OOO), speculative microarchitecture.
The primary design objective is maximizing instruction-level parallelism (ILP) for single-threaded execution.
The hardware supports Simultaneous Multithreading (SMT) to execute up to two threads per core concurrently.
The microarchitecture is deployed in both client processors (Alder Lake P-cores) and server processors (Sapphire Rapids).

To sustain high ILP in the OOO core, the architecture relies on an advanced front-end capable of high-bandwidth instruction delivery.

Instruction Fetch and Front-End Delivery

Instruction fetch bandwidth reaches 32 bytes per cycle from the instruction cache.
Six parallel decoders translate variable-length x86 instructions (ranging from 1 to 15 bytes) into RISC-like micro-ops.
Decoded micro-ops are buffered in a 144-entry micro-op queue.
A microcode engine handles complex x86 instructions, generating micro-op sequences at a rate of up to four micro-ops per cycle.
Micro-op Cache:
- Stores 4K decoded micro-ops.
- Delivers up to 8 micro-ops per cycle directly to the micro-op queue, entirely bypassing the instruction cache and decoders.
Loop Stream Detector (LSD):
- Tracks iterative execution streams that fit entirely within the micro-op queue.
- Disables upstream front-end components during single-threaded loop execution to conserve power.
- Reallocates the idle front-end resources to the alternate thread when operating in SMT mode.
Instruction Translation Lookaside Buffer (ITLB):
- Maintains 256 entries for 4 KB pages and 32 entries for 2 MB or 4 MB pages to support large code footprints.
Branch Prediction:
- The Level 2 Branch Target Buffer (BTB) capacity scales to 12K entries.

Once micro-ops are successfully fetched and decoded, they flow into the execution engine where hardware dynamically schedules them to extract maximum parallelism.

Out-of-Order Execution Engine

The Reorder Buffer (ROB) provides 512 entries to support extremely deep OOO execution windows.
Rename and Issue:
- Renames and issues up to 6 micro-ops per cycle to the underlying schedulers.
- Implements zero and move idiom elimination, allowing the hardware to resolve register clearing and data movement without consuming functional unit scheduling resources.
Execution Schedulers:
- The arithmetic scheduler contains 97 entries shared across integer and floating-point operations.
- Issues up to 5 micro-ops per cycle to integer functional units.
- Issues up to 3 micro-ops per cycle to floating-point and vector functional units.
Advanced ISA Extensions:
- Supports AVX-512 SIMD instructions along with Bfloat16 (BF16) and 8-bit integer (INT8) datatypes for machine-learning workloads.
- Advanced Matrix Extensions (AMX) define a 2D register file containing 8 registers structured as arrays of $16 \times 64$ -byte rows.
- AMX matrix multiplication executes at a maximum computational rate of 2K INT8 operations per cycle.
Misprediction Recovery:
- A 128-entry Branch Order Buffer captures snapshots of the register renaming state at conditional and indirect branch instructions.
- Upon misprediction detection, hardware instantly flushes younger instructions and restores the pre-branch snapshot.
- Older instructions continue executing uninterrupted during recovery, completely bypassing the need to wait for the mispredicted branch to reach the head of the ROB.

The massive computational throughput generated by the OOO engine dictates a highly parallel, high-bandwidth memory subsystem to supply data and prevent execution starvation.

Memory Subsystem and Disambiguation

Load and Store Execution:
- Execution throughput peaks at three 32-byte loads, two 64-byte loads, and two 64-byte stores per cycle.
- The load buffer holds 192 entries, while the store buffer holds 114 entries.
- A 70-entry load scheduler feeds 3 address-generation units (AGUs) per cycle.
- A 38-entry store scheduler feeds 2 AGUs and 2 store data units per cycle.
Memory Disambiguation and Forwarding:
- Speculative memory disambiguation exposes hidden parallelism between concurrent load and store operations.
- Store forwarding hardware supports complex forwarding, including reading partial bytes directly from a pending store while fetching the remainder from the data cache.
Cache Hierarchy:
- L1 Data Cache: 48 KB capacity, 12-way associative, featuring a 5-cycle load-to-use latency for cache hits. A fill buffer tracks up to 16 outstanding cache misses.
- L2 Cache: 1.25 MB (client) or 2 MB (server) capacity, delivering 64 bytes per cycle with a 15-cycle latency. It supports up to 48 outstanding misses or prefetch requests.
- L2 Prefetcher: Uses a pattern-based multipath prefetcher modulated by feedback-based adaptive throttling.
Data Translation Lookaside Buffers (TLBs):
- Data TLB: 96 entries for 4 KB pages (6-way associative), 32 entries for 2/4 MB pages (4-way associative), and 8 entries for 1 GB pages.
- Store TLB: 16 entries applicable across all page sizes.
- Second-Level TLB (STLB): 2K entries backed by a Page Miss Handler (PMH) capable of resolving STLB misses via 4 parallel page-table walks.

The tight integration of the wide front-end, deep OOO execution engine, and aggressive memory subsystem directly shapes the microarchitecture’s performance profile across varied thread counts.

Performance and Throughput Characteristics

Single-Thread Performance:
- Clock frequencies scale up to 5.5 GHz for client architectures and 4.2 GHz for server architectures.
- Cycles Per Instruction ( $CP I$ ) ranges from 0.32 to 0.89, translating to an instruction issue rate of 1.11 to 3.14 instructions per cycle.
- Single-thread throughput relies heavily on minimizing branch misprediction, cache miss, and TLB miss rates.
Simultaneous Multithreading (SMT) Efficiency:
- Dual-threaded SMT execution yields 5% to 46% higher throughput compared to baseline single-threaded execution.
- SMT benefits maximize on workloads exhibiting unpredictable branches or random memory accesses (like pointer-chasing algorithms), which inherently underutilize the wide OOO core.
Multi-Core Contention:
- Activating SMT across all cores sequentially (e.g., up to 224 threads on a 112-core, 2-socket system) severely increases L1, L2, and L3 cache miss rates due to structural resource competition.
- Despite the increased per-thread $CP I$ caused by shared cache contention, the parallel execution of the second thread yields a net effective throughput gain at the core level.

Ultimately, the observed multi-core throughput gains validate Golden Cove’s core strategy: deploying a massively wide OOO pipeline to maximize single-thread ILP, while leveraging SMT to absorb the structural latency introduced by complex memory hierarchies.

My Knowledge Base

Explorer