1 Pipelining

Instruction-level parallelism (ILP) is the potential overlap in the execution of instructions within a program. Maximizing ILP is the primary method for improving uniprocessor performance. There are two largely separable approaches to discovering and exploiting ILP:

Dynamic (Hardware-based): Relies on hardware mechanisms to discover and exploit parallelism at runtime. This approach dominates desktop, server, and mobile processor designs.
Static (Software-based): Relies on compiler technology to find parallelism statically at compile time. This approach is primarily successful in domain-specific architectures or well-structured scientific applications.

Pipelining

Pipelining is an implementation technique that exploits parallelism among instructions in a sequential instruction stream by overlapping their execution.

Pipe Stages (Segments): The discrete, sequential steps that comprise a pipeline, where each stage completes a fraction of an instruction’s execution.
Processor Cycle: The time required to move an instruction one step down the pipeline, typically exactly 1 clock cycle. The duration is dictated by the slowest pipeline stage.
Ideal Speedup: If stages are perfectly balanced, the theoretical speedup equals the number of pipe stages. The idealized time per instruction is defined as:

$Time per instruction = \frac{Time per instruction on unpipelined machine}{Number of pipe stages}$
Throughput vs. Latency: Pipelining increases processor instruction throughput (number of instructions completed per unit of time) but does not reduce the execution time of an individual instruction.

Five-Stage RISC Pipeline

The fundamental pipeline structure decomposes instruction execution into five distinct stages:

Instruction Fetch (IF): Send the program counter (PC) to memory, fetch the instruction, and increment the PC by 4.
Instruction Decode/Register Fetch (ID): Decode the instruction, read source registers, test for branch equality, sign-extend offset fields, and compute potential branch targets. Decoding occurs in parallel with register reading via fixed-field decoding.
Execution/Effective Address (EX): The ALU performs operations based on instruction type:
- Memory reference: Adds base register and offset for an effective address.
- Register-Register/Immediate: Performs ALU operation on register and register/immediate values.
Memory Access (MEM): Read from data memory for load instructions or write to data memory for store instructions.
Write-Back (WB): Write the loaded data or ALU result into the destination register.

Instruction	1	2	3	4	5	6	7	8	9
$i$	IF	ID	EX	MEM	WB
$i + 1$		IF	ID	EX	MEM	WB
$i + 2$			IF	ID	EX	MEM	WB
$i + 3$				IF	ID	EX	MEM	WB
$i + 4$					IF	ID	EX	MEM	WB

Registers are read in the second half of the clock cycle and written in the first half to prevent internal hazards within the register file. Pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB) sit between successive stages to isolate instructions and carry intermediate data/control signals forward.

Pipeline Performance

The performance of a pipelined processor is governed by the cycles per instruction (CPI), which is the sum of the base theoretical CPI and all contributions from pipeline stalls:

$Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls$

Ideal pipeline CPI: A measure of the maximum performance attainable by the hardware implementation.
Structural stalls: Delays caused by hardware resource limitations preventing concurrent instruction execution.
Data hazard stalls: Delays required to maintain correct execution order when instructions depend on the results of previous instructions.
Control stalls: Delays caused by branch instructions determining the flow of the program.

Pipeline Hazards

Hazards prevent the next instruction in the stream from executing during its designated clock cycle, degrading performance from the ideal speedup. The actual speedup calculation incorporates these stall cycles:

$Speedup = \frac{Pipeline depth}{1 + Pipeline stall cycles per instruction}$

1. Structural Hazards

Structural hazards arise from resource conflicts when hardware cannot support all overlapped instruction combinations simultaneously.

Occur primarily when functional units are not fully pipelined or when multiple instructions compete for a singular resource (e.g., a single memory port for both instructions and data).
Addressed by duplicating resources or stalling the pipeline (inserting a pipeline bubble) until the resource is free.

2. Data Hazards

Data hazards occur when the pipeline changes the order of read/write accesses to operands relative to sequential execution. For instructions $i$ before $j$ both using register $x$ :

RAW (Read After Write): $j$ reads $x$ before $i$ writes it — $j$ gets a stale value. The most common hazard; present in the simple 5-stage pipeline.
WAR (Write After Read): $j$ writes $x$ before $i$ reads it — $i$ gets the wrong (future) value. Impossible in simple in-order; only arises with reordering.
WAW (Write After Write): $j$ writes $x$ before $i$ does — $x$ is left with $i$ ‘s value instead of $j$ ‘s. Also impossible in simple in-order; arises with reordering or variable latencies.

For now only RAW matters. Consider add x1,x2,x3 followed by sub x5,x1,x4 — add writes x1 at the end of WB (cycle 5) but sub reads x1 during ID (cycle 3), so sub gets a stale value. The behavior is not even deterministic: an interrupt between the two instructions could cause WB to complete first, changing what sub reads.

Forwarding

The key insight is that the result doesn’t need to be in the register file — it just needs to exist somewhere in the pipeline. Forwarding routes the ALU result directly from the EX/MEM or MEM/WB pipeline register back to the ALU input of the dependent instruction, bypassing the register file entirely. Both ALU inputs can be forwarded simultaneously from different pipeline registers.

Pipeline Interlocks

Forwarding cannot solve every RAW hazard. A load followed immediately by a dependent instruction is the canonical case: the load doesn’t have its data until the end of MEM (cycle 4), but the dependent instruction needs it at the start of that same cycle — forwarding would require going backward in time. A pipeline interlock detects this and stalls the dependent instruction for one cycle, inserting a bubble. After the stall, forwarding from MEM/WB covers the rest.

3. Control Hazards

Control hazards arise from branches that modify the PC. If an instruction alters the PC, the pipeline must fetch from the new target, rendering previously fetched instructions invalid.

Static Prediction Schemes:
- Flush Pipeline: Freeze or delete all instructions after the branch until the target is known. Simple but imposes a fixed, unavoidable penalty.
- Predict-Not-Taken: Continue fetching sequentially as if the branch weren’t there. If the branch is taken, turn the fetched instruction into a no-op and restart at the target. Processor state must not be modified until the outcome is known.
- Predict-Taken: Fetch from the target as soon as it is computed (end of ID), one cycle before the condition is resolved. Saves a cycle when the branch is taken. In both predict schemes, the compiler can improve performance by arranging code so the most frequent path matches the hardware’s prediction.
- Delayed Branch: The instruction immediately following the branch occupies a delay slot and always executes regardless of outcome. The compiler fills this slot with a useful instruction to avoid wasting the cycle. In practice, processors limit this to a single delay slot; RISC-V omitted delayed branches entirely because they complicate dynamic branch prediction.
Dynamic Branch Prediction:
- Branch-Prediction Buffer (Branch History Table): A small memory indexed by the lower bits of the branch PC, each entry holding a bit indicating whether the branch was recently taken. No tags — may alias with other branches but is assumed correct until proven otherwise. On misprediction the bit is flipped.
- 2-Bit Predictors: A prediction must miss twice before it changes direction, encoded as a 4-state saturating counter. Fixes the 1-bit problem where a single exception flip-flops the prediction. A 4K-entry buffer achieves 82–99% accuracy; increasing size beyond that yields negligible gains — the bottleneck is predictor structure, not capacity.
- Correlating (Two-Level) Predictors: Use the behavior of the most recent branches to select among multiple 2-bit predictors, capturing patterned branch behavior.
- Tournament Predictors: Adaptively combine local history predictors and global history predictors, tracking the accuracy of each to choose the best prediction dynamically.
- Branch-Target Buffers (BTB): Caches the predicted target address, allowing fetch to begin immediately if the PC matches a BTB entry.

Pipeline Implementation

Pipelining the data path necessitates multiplexers and logic controlled by the instruction currently migrating through the stages. Control signals are generated in the ID stage and carried along the pipeline registers to dictate EX, MEM, and WB actions.

Load Interlocks: Handled by comparators in the ID stage. The logic compares the destination register of an active load in the EX stage (ID/EX.IR[rd]) against the source registers of the newly decoded instruction (IF/ID.IR[rs1] and IF/ID.IR[rs2]). A match triggers a pipeline stall.
Forwarding Logic: Evaluated dynamically at the start of the EX stage. Comparators check the destination registers of instructions in the MEM and WB stages against the source registers of the instruction in EX. Matches trigger multiplexers to route data from the EX/MEM or MEM/WB latches directly to the ALU inputs.
Branch Optimization: Branch condition testing and target calculation are moved to earlier stages (e.g., ID) to minimize the branch penalty. This reduces the stall duration but requires additional forwarding paths to the branch evaluation logic.

The 5-stage pipeline datapath with forwarding. Pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB) sit between each stage, carrying values and control signals forward. The PC acts as a pipeline register before IF and is written in exactly one stage to avoid branch conflicts. Most data flows left to right; the two right-to-left paths — register write-back and branch PC — are the primary sources of pipeline complexity.

Exception Handling

Pipelining complicates exceptions because multiple instructions are in-flight simultaneously — it becomes unclear which instruction caused the exception and what state has already been modified. Exception sources include: I/O requests, OS service calls, breakpoints, integer/FP overflow, page faults, misaligned access, protection violations, undefined instructions, hardware faults, and power failure.

Exceptions vary across five axes that determine how the hardware must respond:

Synchronous vs. Asynchronous: Synchronous exceptions occur at the same point every run (e.g., page fault, overflow). Asynchronous exceptions come from external devices and can be deferred until the current instruction completes — making them easier to handle.
User Requested vs. Coerced: User-requested exceptions (e.g., a syscall) are predictable and always handled after the instruction completes. Coerced exceptions (e.g., page fault) are triggered by hardware events outside program control and are harder to implement.
User Maskable vs. Nonmaskable: Whether the user program can suppress the hardware’s response.
Within vs. Between Instructions: Exceptions occurring mid-instruction (e.g., during EX or MEM) require the instruction to be stopped and restarted — the hardest case. Those occurring between instructions are straightforward.
Resume vs. Terminate: Resumable exceptions require the processor to save state, handle the event, and cleanly restart the faulting instruction. Terminating exceptions simply stop execution.

The hardest case is a synchronous, coerced, within-instruction, resumable exception — e.g., a page fault during MEM. This requires the pipeline to be restartable: state is saved, the exception handled, and the instruction re-executed as if nothing happened.

Precise Exceptions

A pipeline has precise exceptions if, when an exception is taken, all instructions before the faulting instruction have completed and all instructions after it (including the faulting one) have had no effect on processor state.

Multiple exceptions can fire in the same cycle — e.g., a load in MEM causes a page fault while an add in EX causes an overflow. Exceptions may also arrive out of program order (a later instruction can fault earlier in the pipeline). To handle this correctly:

Each instruction carries an exception status vector through the pipeline registers alongside it.
Once an exception is posted, all register and memory writes for that instruction and all younger instructions are disabled — no state changes occur.
When an instruction reaches WB, its status vector is checked. Exceptions are handled in program order — the earliest instruction’s exception first.
The OS saves the faulting PC, resolves the exception, and restarts from that instruction.

This guarantees precise, in-order exception delivery regardless of how instructions overlap in the pipeline.

Multicycle and FP Pipelines

Floating-point (FP) operations require execution times exceeding a single clock cycle, necessitating structural changes to the pipeline.

FP Pipeline Structure

The EX stage is divided into multiple independent functional units, varying in pipelining and duration:

Integer Unit: 1-cycle latency, fully pipelined.
FP Adder: Multi-cycle latency (e.g., 3 cycles), fully pipelined.
FP/Integer Multiplier: Multi-cycle latency (e.g., 6 cycles), fully pipelined.
FP/Integer Divider: High latency (e.g., 24 cycles), unpipelined (requires initiation interval matching latency).

Multicycle Hazards

Because the instructions have varying running times, instructions no longer reach WB in order. This changes the hazard landscape. Five problems arise compared to the integer pipeline:

Structural hazards: The divider is not fully pipelined — issue must stall when it’s busy. Separately, varying latencies mean multiple instructions can converge on the single register file write port in the same cycle. Detected in ID using a shift register tracking future write port usage; conflicting instructions stall before issuing. One option is to increase write ports, but that is expensive since the maximum steady-state need is one — it’s better to detect and enforce it as a structural hazard.
Increased RAW hazards: Longer latencies widen the producer-consumer gap, causing more frequent stalls than in the integer pipeline. The increase is fundamentally the same kind of hazard — just more of it.
WAW hazards: A later-issued short-latency instruction can write a register before an earlier long-latency one, leaving the wrong value. Note this only occurs when a result is overwritten before any instruction uses it — if there were an intervening read, a RAW hazard would have already stalled the pipeline. Detected in ID — if any in-flight instruction targets the same destination register, stall the new instruction or squash the earlier write.
WAR hazards: Not possible — register reads always occur in ID, before any write can overtake them.

Imprecise Exceptions

Instructions completing out of order means a later instruction can fault after an earlier one has already committed — producing an imprecise exception. Four approaches to handle this:

Ignore it — accept imprecise exceptions. Unacceptable with virtual memory or IEEE FP.
Buffer results — hold results until all older instructions commit. Expensive. Variants: history file (saves old register values for rollback) and future file (buffers new values; main file stays precise).
Software reconstruction — record all in-flight PCs, take the exception, let software simulate the incomplete instructions, then resume. Tractable only if FP overlap is limited. Used in some SPARC implementations.
Early exception detection — FP units signal potential exceptions within the first few EX cycles, stalling younger instructions before they commit. Precise exceptions with no buffering cost. Used in MIPS R2000/3000, R4000, Intel Pentium.

Superpipelining

To achieve higher clock rates, designers divide standard pipeline stages into multiple, shorter sub-stages, a technique known as superpipelining.

MIPS R4000 Pipeline

The standard 5-stage pipeline expands to 8 stages, primarily by decomposing memory accesses:

IF: First half of instruction fetch.
IS: Second half of instruction fetch.
RF: Instruction decode, register fetch, hazard checking.
EX: Execution, effective address calculation, branch condition evaluation.
DF: First half of data fetch.
DS: Second half of data fetch.
TC: Tag check (cache hit detection).
WB: Write-back to register file.

Impacts

Increased Latencies: Load-use delays expand (e.g., requiring 2 stall cycles instead of 1). Branch delays lengthen (e.g., 3 cycles), making static prediction less effective and elevating the necessity of dynamic branch prediction.
Forwarding Complexity: Data forwarding logic scales exponentially, requiring bypass networks across multiple intermediate stages (e.g., EX/DF, DF/DS, DS/TC, and TC/WB).

Dynamic Scheduling

In-order pipelines stall everything when one instruction stalls — even independent later instructions are blocked. Dynamic scheduling lets the hardware reorder execution so independent instructions bypass stalled ones. All instructions still issue in order, but can read operands and execute out of order.

The ID stage splits into two parts:

Issue — decode and check structural hazards (in-order)
Read Operands — wait for data hazards to clear, then execute (out-of-order)

Out-of-order execution introduces WAR hazards that don’t exist in in-order pipelines.

Scoreboarding

The scoreboard centralizes all hazard detection. Every instruction passes through four stages:

Issue: Checks for structural and WAW hazards. If the functional unit is free and no in-flight instruction targets the same destination register, the instruction issues. Otherwise stalls — no later instructions issue until cleared.
Read Operands: Waits until all source registers are not pending writes from earlier instructions (RAW check). Once clear, operands are read from the register file and execution begins. Instructions can enter this stage out of order.
Execute: Functional unit runs the operation and notifies the scoreboard on completion.
Write Result: Checks for WAR hazards — stalls if any earlier-issued instruction hasn’t yet read a register this instruction is about to overwrite. Once clear, writes result to the register file.

The scoreboard does not use forwarding — operands are only read from the register file when both are ready. The penalty is smaller than it seems because instructions write results as soon as they complete (not at a fixed pipeline slot), reducing effective latency. A structural hazard also exists on the register file bus: the scoreboard limits how many units can read operands or write results simultaneously to match available bus bandwidth.

The RISC-V processor with a scoreboard. The scoreboard controls instruction execution via vertical control lines. All data flows between the register file and functional units over buses. Two FP multipliers, one FP divider, one FP adder, and one integer unit share one set of buses (two inputs, one output) per group.

My Knowledge Base

Explorer