3 Dynamic Superscalars

Superscalar Processors

Superscalar processors discover and exploit ILP using hardware techniques, achieving an ideal CPI $< 1$ . Dominant across server, desktop, laptop, and cellphone, their design complexity has led vendors to combine them with SIMD (DLP) and multicore (TLP).

Functional Equivalency: Must produce the same register/memory values, exceptions, and visibility order as a simple in-order processor, even when executing OOO.
OOO Execution: Most modern superscalars execute in data-dependence order. OOO superscalar and superscalar are largely synonymous today.
Binary Compatibility: Delivers performance gains for existing binaries without recompilation, a key driver of commercial success.
Compiler Reliance: Hardware applies optimizations dynamically within a window of tens to hundreds of instructions, reducing reliance on advanced compiler techniques.

Performance targets $CPU time = IC \times CPI \times CCT$ via:

Multiple Issue (↓ CPI): 2–16 instructions per cycle, best-case CPI = 1/N for an N-wide processor. A 4-wide, 15-stage design overlaps up to 60 instructions concurrently.
Deeper Pipelining (↓ CCT): 10–20 stages, 1–5 GHz. Simple ops take 1 stage, complex ops (multiply, FP, cache) take 2–4 stages. About half of all stages handle ILP scheduling logic.

Dependences

Dependences are a property of programs. Whether they result in hazards and stalls is a property of the pipeline. Two independent instructions can execute simultaneously without stalls, while dependent instructions must execute in order. Dependences can be overcome by scheduling code to avoid the hazard or by transforming code to eliminate the dependence. A superscalar processor attacks each class of dependence with a dedicated hardware mechanism.

Data Dependences

Instruction $j$ is data dependent on instruction $i$ if $i$ produces a result that $j$ may use, or if $j$ depends on $k$ which depends on $i$ (a dependence chain). A data dependence conveys:

Hazard Potential: The possibility of a pipeline hazard.
Execution Order: The order in which results must be calculated.
Parallelism Limit: An upper bound on exploitable parallelism.

Data flow through registers is easy to detect since names are fixed in instruction encodings. Data flow through memory is harder since two different-looking addresses may refer to the same location, and effective addresses may change across executions.

Solution: Dynamic Scheduling. Hardware executes instructions out of program order as soon as data dependences are resolved. Two structures manage this:

Instruction Scheduler: A table of in-flight instruction entries. Each row holds identity fields (Busy, Op, DstReg, ROB) and two operand groups (Input Reg j and Input Reg k), each tracking whether the operand is needed, which physical register holds it, and whether its value is ready. A Ready flag is derived when both operands are satisfied. The Allocator fills entries in-order at issue. Completion notifications from FUs broadcast into the operand fields, updating readiness. The Scheduler scans the Ready column and, given FU availability, dispatches instructions out-of-order.

Reorder Buffer (ROB): A FIFO circular buffer that enforces in-order commit, guaranteeing precise exceptions. Each entry tracks Ready (execution complete), Dst V (destination valid), Dst AReg (architectural register), Dst PReg (physical register), PC, and Exception/Misprediction Info. Completion signals from FUs update Ready, while exception and misprediction signals arrive separately. The ROB head retires the oldest complete, non-exceptional instruction in order. A misprediction or exception at the head flushes all subsequent instructions, discards allocated physical registers, and resumes at the correct PC.

Name Dependences

A name dependence occurs when two instructions use the same register or memory location but no data flows between them:

Antidependence (WAR): $j$ writes a location that $i$ reads. Original order must be preserved so $i$ reads the correct value.
Output Dependence (WAW): $i$ and $j$ write the same location. Ordering must ensure the final value is $j$ ‘s.

Because no value is transmitted, name dependences can be eliminated by register renaming, done statically by the compiler or dynamically by hardware. Each dependence type maps to a pipeline hazard: RAW from data dependences, WAW from output dependences, WAR from antidependences.

Solution: Register Renaming. Register renaming allows superscalar processors to use a large number of physical registers to eliminate WAR and WAW hazards from name dependences. Three hardware structures manage this:

Unified Register File (URF): A large set of physical registers ( $p 0$ – $p 127$ ) holding values for both the 32 architecturally visible registers and temporary values for uncommitted instructions. Each entry carries three metadata bits (Free, Architectural, Ready).
Renaming Map: Maps each architectural register to the physical register currently holding its latest value. Updated speculatively as instructions are issued in the front-end; mappings may be cancelled later due to misprediction or exception.
Architectural Map: Maps each architectural register to its physical register based only on committed instructions. Updated in the back-end and used for exception and misprediction recovery.

The renaming map tracks the latest speculative mapping, while the architectural map tracks only committed state. The URF holds values for both, with the F/A/R bits indicating each register’s current role.

Physical registers transition through four states encoded by (F, A, R) bits:

Free (1, 0, 0): Not in use, available for allocation.
Used and Not Ready (0, 0, 0): Allocated to a pending instruction that has not yet completed.
Used and Ready (0, 0, 1): Producing instruction has completed. Value available but not yet committed.
Architectural (0, 1, 1): Result of a committed instruction, pointed to by the architectural map. Always ready.

At reset, the first 32 physical registers are in the architectural state; the rest are free. Registers transition through these states as instructions advance:

Issue (in-order): Looks up input operands in the renaming map. Allocates a new physical register $p_{j}$ for output register $x_{i}$ , updates the renaming map to $x_{i} \to p_{j}$ , and sets $p_{j}$ to Used-Not-Ready.
Dispatch (OOO): When all input registers reach architectural or used-and-ready state, the instruction dispatches. $p_{j}$ remains Used-Not-Ready.
Completion (OOO): The functional unit stores the result into $p_{j}$ and transitions it to Used-Ready, unblocking dependent instructions.
Commit (in-order): $p_{j}$ transitions to architectural. The architectural map updates $x_{i} \to p_{j}$ , and the previously mapped register $p_{k}$ transitions to free.
Recovery (in-order): All used physical registers are marked free. The speculative renaming map is overwritten by a bulk copy from the architectural map.

Renaming logic is complex because superscalar processors rename all instructions within a bundle in a single clock cycle:

Multiported Structures: The renaming map requires $2 N$ read ports and $N$ write ports per cycle for $N$ instructions. The architectural map requires $M$ write ports for $M$ committing instructions.
Intra-bundle Dependences: If an instruction depends on an earlier instruction in the same bundle, it must use the physical register allocated by that earlier instruction directly, bypassing the renaming map lookup.
Physical Register Deallocation: A physical register $p_{k}$ for $x_{i}$ can only be freed when the overwriting instruction commits, ensuring no pending exception requires restoring $p_{k}$ . This trades longer occupancy for simpler hardware.

Control Dependences

A control dependence determines the ordering of an instruction with respect to a branch. Control dependence itself is not the critical property. What must be preserved are:

Exception Behavior: Reordering must not cause exceptions that would not have occurred in original program order.
Data Flow: The actual flow of values from producers to consumers must be maintained. Branches make data flow dynamic by allowing a given instruction to receive data from multiple predecessors. Program order determines which predecessor delivers the correct value.

Solution: Branch Prediction. Programs execute a branch every 4–7 instructions, severely restricting block sizes for extracting ILP. Hardware predicts both branch direction and target address before the branch resolves:

Direction Predictors: Use saturating counters indexed by branch history. Correlating predictors (e.g., gshare) XOR-hash the PC with global history. Tournament predictors combine a global and a local predictor with a 2-bit saturating counter selector. TAGE uses multiple tables with increasing history lengths and tags to avoid aliasing.
Branch Target Buffer (BTB): Caches predicted target addresses. If the PC matches a BTB entry and the branch is predicted taken, fetch redirects immediately with zero penalty cycles.
Return Address Stack (RAS): Handles procedure returns by pushing the return address on call and popping on return, yielding near-perfect prediction.

Instructions along the predicted path execute speculatively. Mispredictions flush the pipeline and redirect the PC.

Memory Dependences

Memory dependences are harder to detect than register dependences. Register names are encoded in instructions and checks happen early. Memory addresses must be computed by an AGU first, and that computation may itself stall due to register dependences. Two different-looking addresses may also alias to the same location. The naive solution is to execute loads and stores in program order, but this blocks ILP across loop iterations since stores cannot overwrite cache data before committing.

Load Bypassing. A load can bypass older stores if all older store addresses are known and none match. Store buffer comparators check the load address against all pending stores, letting later-iteration loads access the cache early without waiting for older stores to commit.

Store Forwarding. If a load is data dependent on an older store, it retrieves data directly from that store’s physical register rather than the cache. This is critical for x86, where frequent register spills and reloads would otherwise serialize on memory.

Speculative Memory Disambiguation. Loads can bypass stores whose addresses are not yet known, speculating no dependence exists. When those addresses resolve, checks are performed. A wrong speculation flushes and re-executes the load and all dependents. Naively speculating can be worse than not speculating, so processors implement selective speculation, learning which loads are likely dependent and stalling only those.

Store Sets. Two structures track which load-store pairs have caused past mispeculations:

SSIT (Store Set ID Table): Indexed by instruction PC, maps each load/store to a Store Set ID.
LFST (Last Fetched Store Table): Indexed by Store Set ID, holds the last fetched store in each set.

The instruction PC indexes into the SSIT (1024 entries), which returns a valid Store Set ID. That ID then indexes into the LFST (64 entries), which holds the PC of the last fetched store in that set. A load with the same SSIT entry stalls if that LFST entry is valid and the store has not yet computed its address.

On mispeculation, a Store Set ID is assigned to both the load and store (sets are merged if they already differ). On store dispatch, the store records itself in the LFST, which clears once its address is known. On load dispatch, if the LFST holds a pending store without a known address, the load stalls. Both tables reset periodically to shed over-conservative entries caused by aliasing and set growth.

Multiprocessor systems add further ordering constraints even after addresses are known; the hardware learns when reordering causes consistency violations and restricts speculation accordingly.

Pipeline Macro-Stages

Instructions flow through four stages: Issue (in-order, allocate ROB and scheduler entry), Dispatch (OOO, when all operands ready), Completion (OOO, compute and broadcast result), and Commit or Recovery (in-order at ROB head). Modern superscalar processors synthesize these into three macro-stages:

Front-end (In-order, Speculative): Fetches and decodes instruction bundles. Predicts branches, renames registers, and allocates ROB and scheduler entries. Halts fetch if physical registers or ROB entries are exhausted.
Execution Engine (Out-of-order, Speculative): Tracks operand readiness across the instruction window. Dispatches ready instructions to functional units and buffers outcomes in the ROB.
Back-end (In-order, Non-speculative): Governs the ROB. Retires instructions chronologically, updates permanent architectural state, and coordinates precise recovery.

Advanced Issues

Multiple Issue Complexity

To correctly issue $N$ instructions in a single clock cycle, the hardware must process the entire bundle concurrently:

Allocate and update $N$ ROB entries, $N$ scheduler/load-store queue entries, and up to $N$ physical registers.
Rename all input operands while simultaneously resolving data and name dependences between the $N$ instructions in the same bundle.
Hardware complexity scales as $O (N^{2})$ , capping practical issue widths to 4–8 instructions per cycle.
Dispatching is vastly simpler since all dependences are fully resolved by that stage.

Speculation and Energy

Speculation executes instructions past an unresolved branch, decoupling long-latency operations from control dependences.
Degrades performance if it triggers cache misses or TLB misses that would not have occurred in non-speculative execution.
Wide-issue processors frequently have multiple unresolved branches in flight. Branch predictors must update history registers and target structures before previous branches resolve, and speculative updates must be tracked and rolled back on misprediction.
Incorrect speculation wastes dynamic power on unneeded instructions and incurs additional cost reverting speculative state.
The speculation infrastructure (branch predictors, renaming tables, instruction windows, ROB) constantly consumes static and dynamic power.
Integer workloads average 30% mispeculation rates, making speculation highly energy-inefficient, worsened by the end of Dennard scaling.

Limits of Superscalar Processors

Pipeline Complexity and Power: $O (N^{2})$ wide-issue logic forces architects toward SIMD (DLP) and multicore (TLP), which require less complex hardware.
Branch Prediction Accuracy: Deep pipelines amplify misprediction penalties. Minor drops in accuracy render wide pipelines ineffective.
The Memory Wall: OOO execution hides L1 miss latency but cannot hide L3 or main memory misses taking 50–100 cycles. A single long-latency miss at the ROB head blocks all younger instructions from committing. Once the ROB fills, the front-end stalls, neutralizing all remaining ILP.

Even with perfect branch prediction, unlimited registers, and perfect memory disambiguation, available ILP in real-world programs is inherently limited. The industry shifted toward multicore and heterogeneous architectures as a result.

My Knowledge Base

Explorer