04 Superscalar

Key ILP Concepts in Modern Superscalar Processors

Superscalar Processors and Functional Equivalency

Superscalar processors employ hardware mechanisms to discover and exploit instruction-level parallelism (ILP), achieving an ideal Cycles Per Instruction (CPI) of $< 1$ .
Out-of-order (OOO) execution allows instructions to process based on data dependences rather than the strict chronological program order.
Hardware must maintain functional equivalency to a purely sequential processor, guaranteeing that final register values, memory states, and the visibility order of exceptions exactly match the original program semantics.

To achieve sub-1 CPI metrics while preserving strict sequential semantics, hardware scales operations across both horizontal and vertical pipeline dimensions.

Performance Scaling: Deeper Pipelining and Multiple Issue

Processor performance relies on optimizing the execution time equation: $CP U t im e = I C \times CP I \times CCT$
Deeper Pipelining: Modern pipelines extend to 10–20 stages, enabling shorter clock cycle times and frequencies of 1–5 GHz.
- Simple operations (e.g., integer addition) require a single stage, while complex operations (e.g., multiplication, cache access) span 2–4 stages.
- Approximately half of all pipeline stages are dedicated to complex control logic for dynamic ILP scheduling.
Multiple Issue: Hardware initiates 2–16 instructions per clock cycle.
- A 4-wide, 15-stage processor can overlap the execution of up to 60 instructions concurrently.

Processing dozens of overlapping instructions introduces severe control hazards that must be circumvented to prevent continuous pipeline stalls.

Control Dependences: Branch Prediction and Speculative Execution

Programs typically execute a branch every 4–7 instructions, severely restricting the block sizes available for extracting ILP.
Dynamic Branch Prediction: Hardware structures track execution history to forecast both branch direction (taken/not-taken) and the target address in parallel with instruction fetch operations.
Speculative Execution: Instructions fetched along the predicted path execute speculatively before the branch condition is definitively known.
- Branch outcomes are validated during execution.
- Mispredictions trigger a recovery mechanism that cancels incorrectly fetched instructions, flushes the pipeline, and redirects the program counter (PC).
- Deep, wide pipelines require extreme prediction accuracy; a single misprediction on a 4-wide, 10-cycle delay pipeline wastes 40 potential instruction executions.

Once branch prediction supplies a continuous stream of speculative instructions, the hardware must untangle artificial operand constraints to allow parallel processing.

Name Dependences: Register Renaming

Aggressively overlapping loop iterations or sequential instructions generates Write-After-Write (WAW) and Write-After-Read (WAR) name dependences due to the limited number of architectural registers defined by the Instruction Set Architecture (ISA).
Register Renaming: Hardware dynamically maps the small set of architectural registers to a substantially larger internal pool of physical registers.
- Eliminates false name dependences, enabling hardware to effectively unroll loops and overlap independent executions that share identical architectural register names.
- Renaming logic tracks physical register state, updating tables to indicate which physical register holds an instruction’s input and allocating new physical registers for outputs.

With false name dependences removed, the pipeline isolates the true data flow, permitting instructions to execute precisely when their required operands manifest.

Data Dependences: Dynamic Scheduling and OOO Execution

Dynamic Scheduling: Hardware executes fetched instructions in data-flow order, circumventing chronological program order stalls.
Instruction Wakeup: Hardware continuously monitors the readiness of mapped physical registers. An instruction awakens when all its input operands are generated by preceding instructions.
Instruction Dispatch: Ready instructions are dispatched to available functional units based on heuristics like instruction age and criticality.
- This OOO execution allows the processor to continue processing independent instructions while bypassing high-latency operations, such as L1 cache misses.

While register data dependences are easily determined via renaming, data flowing through memory introduces dynamic address uncertainty that hardware must aggressively manage.

Memory Dependences: Speculative Memory Disambiguation

Data dependences through memory locations are complex because effective memory addresses are unknown until calculated by an address generation unit.
Stalling all loads until the addresses of all older uncommitted stores are mathematically resolved severely bottlenecks ILP.
Speculative Memory Disambiguation: Hardware tracks historical load/store patterns to predict potential address collisions.
- Loads predicted to be independent bypass older uncalculated stores and execute speculatively.
- Once all older store addresses resolve, the hardware verifies the speculation. If a collision occurred, the load and all subsequent dependent instructions are canceled and re-executed.

Widespread speculation across branches, data flow, and memory addresses necessitates a centralized mechanism to safely commit valid state and cleanly revert mispredictions.

Speculation Recovery: The Reorder Buffer (ROB) and Precise Exceptions

Reorder Buffer (ROB): A FIFO circular hardware structure that isolates speculative, OOO execution from the permanent architectural state.
Lifecycle of an Instruction in the ROB:
- Allocation: Instructions receive ROB entries during the decode/issue phase in strict program order.
- Execution: Instructions execute OOO, buffering computed results and any generated exception flags into their designated ROB entries.
- In-order Commit (Retirement): Hardware constantly evaluates the oldest instruction at the head of the ROB. If complete and non-exceptional, its results permanently update the architectural registers and memory, and the physical register of the previous value is freed.
Mispeculation Recovery: If the ROB head contains a mispredicted branch or an invalid memory load, the ROB frees the entry, flushes all subsequent speculative instructions, discards allocated physical registers, and resumes fetching at the correct PC.
Precise Exceptions: If the ROB head contains an instruction marked with an exception (e.g., a page fault), the hardware cancels the instruction and all that follow, transferring control to the OS handler without exposing any speculative or OOO artifacts.

The ROB forms the backbone of the superscalar architecture, seamlessly unifying the chaotic out-of-order execution engine with a stable, in-order architectural interface.

Pipeline Architecture Overview

Modern superscalar processors synthesize these concepts into three primary macro-stages that process multiple instructions concurrently:
- Front-end (In-order, Speculative): Fetches and decodes instruction bundles. Predicts branches, renames registers, and allocates entries in the ROB and scheduling queues. Halts fetch if physical registers or ROB entries are exhausted.
- Execution Engine (Out-of-order, Speculative): Tracks operand readiness across the instruction window. Dispatches ready instructions to available functional units. Computes results and buffers outcomes in the ROB.
- Back-end (In-order, Non-speculative): Governs the ROB. Retires instructions chronologically, updates the permanent architectural state, and coordinates precise recovery from exceptions and mispeculations.

My Knowledge Base

Explorer

04 Superscalar

Key ILP Concepts in Modern Superscalar Processors