Pipelining: Basic and Intermediate Concepts

Pipelining is an implementation technique that exploits parallelism among instructions in a sequential instruction stream by overlapping their execution.

Pipe Stages (Segments): The discrete, sequential steps that comprise a pipeline, where each stage completes a fraction of an instruction’s execution.
Processor Cycle: The time required to move an instruction one step down the pipeline, typically exactly 1 clock cycle. The duration is dictated by the slowest pipeline stage.
Ideal Speedup: If stages are perfectly balanced, the theoretical speedup equals the number of pipe stages. The idealized time per instruction is defined as: $Time per instruction = \frac{Time per instruction on unpipelined machine}{Number of pipe stages}$
Throughput vs. Latency: Pipelining increases processor instruction throughput (number of instructions completed per unit of time) but does not reduce the execution time of an individual instruction.

The Classic Five-Stage RISC Pipeline

The fundamental pipeline structure decomposes instruction execution into five distinct stages:

Instruction Fetch (IF): Send the program counter (PC) to memory, fetch the instruction, and increment the PC by 4.
Instruction Decode/Register Fetch (ID): Decode the instruction, read source registers, test for branch equality, sign-extend offset fields, and compute potential branch targets. Decoding occurs in parallel with register reading via fixed-field decoding.
Execution/Effective Address (EX): The ALU performs operations based on instruction type:
- Memory reference: Adds base register and offset for an effective address.
- Register-Register/Immediate: Performs ALU operation on register and register/immediate values.
Memory Access (MEM): Read from data memory for load instructions or write to data memory for store instructions.
Write-Back (WB): Write the loaded data or ALU result into the destination register.

Registers are read in the second half of the clock cycle and written in the first half to prevent internal hazards within the register file. Pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB) sit between successive stages to isolate instructions and carry intermediate data/control signals forward.

While idealized pipelining assumes independent instructions, overlapping execution exposes inter-instruction dependencies that disrupt the pipeline flow.

Pipeline Hazards

Hazards prevent the next instruction in the stream from executing during its designated clock cycle, degrading performance from the ideal speedup. The actual speedup calculation incorporates these stall cycles: $Speedup = \frac{Pipeline depth}{1 + Pipeline stall cycles per instruction}$

Structural Hazards

Structural hazards arise from resource conflicts when hardware cannot support all overlapped instruction combinations simultaneously.

Occur primarily when functional units are not fully pipelined or when multiple instructions compete for a singular resource (e.g., a single memory port for both instructions and data).
Addressed by duplicating resources or stalling the pipeline (inserting a pipeline bubble) until the resource is free.

Data Hazards

Data hazards emerge when pipeline overlap changes the order of read/write accesses to operands compared to sequential execution.

Read After Write (RAW): A reading instruction executes before a preceding writing instruction completes, fetching an obsolete value.
Write After Read (WAR): A writing instruction executes before a preceding reading instruction, causing the read to fetch an erroneously updated value.
Write After Write (WAW): Two writing instructions execute out of order, leaving the destination register with an incorrect final value.

Overcoming Data Hazards

Forwarding (Bypassing): Hardware paths route ALU results directly from EX/MEM and MEM/WB pipeline registers back to the ALU inputs, bypassing the write-back stage and eliminating RAW hazard stalls for consecutive arithmetic operations.
Pipeline Interlocks: Hardware detects hazards that forwarding cannot resolve (e.g., a load-use hazard where data is unavailable until the end of the MEM stage). The interlock stalls the dependent instruction in the ID stage until the data is ready.

Control Hazards (Branch Hazards)

Control hazards arise from branches that modify the PC. If an instruction alters the PC, the pipeline must fetch from the new target, rendering previously fetched instructions invalid.

Static Prediction Schemes:
- Flush Pipeline: Hold or delete instructions following the branch until the destination is known.
- Predict-Not-Taken: Fetch sequential instructions; turn them into no-ops if the branch is taken.
- Predict-Taken: Assume branch is taken and fetch from target immediately upon calculation.
- Delayed Branch: Execute the instruction immediately following the branch (in the delay slot) regardless of branch outcome.
Dynamic Branch Prediction:
- Branch-Prediction Buffer (Branch History Table): A small memory indexed by the lower bits of the branch address containing state bits indicating recent branch behavior.
- 2-Bit Predictors: A finite-state machine requiring two consecutive mispredictions before changing the prediction direction.
- Correlating (Two-Level) Predictors: Use the behavior of the most recent branches to select among multiple 2-bit predictors, capturing patterned branch behavior.
- Tournament Predictors: Adaptively combine local history predictors and global history predictors, tracking the accuracy of each to choose the best prediction dynamically.
- Branch-Target Buffers (BTB): Caches the predicted target address, allowing fetch to begin immediately if the PC matches a BTB entry.

Detecting these hazards and controlling the pipeline data path requires dedicated logic distributed across the pipeline stages.

Pipeline Implementation and Control

Pipelining the data path necessitates multiplexers and logic controlled by the instruction currently migrating through the stages. Control signals are generated in the ID stage and carried along the pipeline registers to dictate EX, MEM, and WB actions.

Hazard Detection Logic

Load Interlocks: Handled by comparators in the ID stage. The logic compares the destination register of an active load in the EX stage (ID/EX.IR[rd]) against the source registers of the newly decoded instruction (IF/ID.IR[rs1] and IF/ID.IR[rs2]). A match triggers a pipeline stall.
Forwarding Logic: Evaluated dynamically at the start of the EX stage. Comparators check the destination registers of instructions in the MEM and WB stages against the source registers of the instruction in EX. Matches trigger multiplexers to route data from the EX/MEM or MEM/WB latches directly to the ALU inputs.
Branch Optimization: Branch condition testing and target calculation are moved to earlier stages (e.g., ID) to minimize the branch penalty. This reduces the stall duration but requires additional forwarding paths to the branch evaluation logic.

Although basic hazard detection maintains correct execution order, unpredictable runtime events can abruptly alter the instruction stream, requiring complex recovery mechanisms.

Exception Handling

Exceptions abruptly alter normal instruction execution flow. Handling exceptions within a pipelined architecture is complex due to multiple instructions occupying the pipeline simultaneously.

Classification of Exceptions

Synchronous vs. Asynchronous: Triggered by the program itself (synchronous) or by external devices (asynchronous).
User Requested vs. Coerced: Intentionally invoked by the user or forced by hardware events.
User Maskable vs. Nonmaskable: Can be ignored/disabled by software, or strictly enforced.
Within vs. Between Instructions: Occurring mid-execution (requires halting and restarting) or cleanly between instruction boundaries.
Resume vs. Terminate: Program execution continues after handling, or execution is aborted.

Precise Exceptions

A pipeline supports precise exceptions if the hardware can stop execution such that all instructions prior to the faulting instruction complete, while the faulting instruction and all subsequent instructions are completely nullified.

Implementation: Exceptions are not handled immediately. Instead, an exception status vector is attached to the instruction and flows with it through the pipeline registers.
Enforcement: When the instruction reaches the WB stage, the status vector is checked. If an exception is flagged, register and memory writes are disabled. The OS saves the faulting PC, resolves the exception, and restarts execution.
Instruction Set Complications: Instructions that modify state mid-execution (e.g., autoincrement addressing) or execute over many cycles (e.g., string copies) complicate precise exception handling, often requiring hardware to roll back intermediate state changes.

Complex instructions and varying latencies complicate exception handling, a problem highly amplified when extending the pipeline for multi-cycle floating-point operations.

Multicycle Operations and Floating-Point Pipelines

Floating-point (FP) operations require execution times exceeding a single clock cycle, necessitating structural changes to the pipeline.

FP Pipeline Structure

The EX stage is divided into multiple independent functional units, varying in pipelining and duration:

Integer Unit: 1-cycle latency, fully pipelined.
FP Adder: Multi-cycle latency (e.g., 3 cycles), fully pipelined.
FP/Integer Multiplier: Multi-cycle latency (e.g., 6 cycles), fully pipelined.
FP/Integer Divider: High latency (e.g., 24 cycles), unpipelined (requires initiation interval matching latency).

New Hazards in Multicycle Pipelines

Structural Hazards: Slower, unpipelined units (like the divider) restrict continuous instruction issue. Additionally, varying latencies allow multiple instructions to attempt writing to the register file in the same WB cycle.
WAW Hazards: Out-of-order completion is introduced because a short-latency instruction can complete before a long-latency instruction issued earlier. The pipeline must stall issue if a WAW hazard is detected.
Increased RAW Hazards: Longer latencies increase the separation required between producers and consumers, generating more frequent RAW stalls.

Out-of-Order Completion and Exceptions

Out-of-order completion violates strict precise exception models. A long-latency instruction (e.g., divide) might fault after a subsequent short-latency instruction (e.g., add) has already updated the processor state.

Solutions:
- Maintain history files or future files to restore the unmodified register state.
- Limit the overlap of FP instructions, enforcing a slower, precise execution mode.
- Buffer results until all preceding instructions complete successfully.

The challenges of multi-cycle execution are compounded when performance demands push architectures toward even deeper pipelining.

Deep Pipelining: Superpipelining

To achieve higher clock rates, designers divide standard pipeline stages into multiple, shorter sub-stages, a technique known as superpipelining.

Structure of a Deeper Pipeline (e.g., MIPS R4000)

The standard 5-stage pipeline expands to 8 stages, primarily by decomposing memory accesses:

IF: First half of instruction fetch.
IS: Second half of instruction fetch.
RF: Instruction decode, register fetch, hazard checking.
EX: Execution, effective address calculation, branch condition evaluation.
DF: First half of data fetch.
DS: Second half of data fetch.
TC: Tag check (cache hit detection).
WB: Write-back to register file.

Impacts of Superpipelining

Increased Latencies: Load-use delays expand (e.g., requiring 2 stall cycles instead of 1). Branch delays lengthen (e.g., 3 cycles), making static prediction less effective and elevating the necessity of dynamic branch prediction.
Forwarding Complexity: Data forwarding logic scales exponentially, requiring bypass networks across multiple intermediate stages (e.g., EX/DF, DF/DS, DS/TC, and TC/WB).

As static pipelines deepen and structural constraints mount, rigid in-order execution limits throughput, necessitating dynamic approaches to instruction scheduling.

Dynamic Scheduling and Scoreboarding

In a statically scheduled pipeline, an in-order issue policy forces the entire pipeline to stall upon encountering a hazard, even if subsequent instructions are fully independent. Dynamic scheduling allows the hardware to rearrange execution order to circumvent stalls.

Scoreboarding

Scoreboarding centralizes hazard detection and resolution, decoupling instruction issue from execution. The traditional ID stage is split into two dynamically monitored phases:

Issue: Decodes instructions and checks for structural and WAW hazards. If the required functional unit is free and no active instruction shares the destination register, the instruction issues. Otherwise, it stalls, enforcing in-order issue but preparing for out-of-order execution.
Read Operands: The scoreboard monitors RAW hazards. The instruction waits until all source operands are generated by prior active instructions. Once available, operands are read and execution begins.

Out-of-Order Execution and Completion

Execute: The functional unit executes the operation over its required multicycle latency and notifies the scoreboard upon completion.
Write Result: Checks for WAR hazards. If a previously issued instruction has not yet read the old value of the destination register, the completing instruction stalls its write-back until the operand is safely consumed.

Dynamic scheduling mitigates latency and hazard penalties, closing the gap between raw hardware limits and effective application throughput.

Fallacies and Pitfalls

Pitfall: Unexpected execution sequences may cause unexpected hazards. Highly uncommon code sequences, such as an exception trap routine writing to a register that is the destination of a still-pending multi-cycle divide, can trigger rare WAW hazards that hardware or software must meticulously guard against.
Pitfall: Extensive pipelining can negatively impact overall cost-performance. Deepening the pipeline increases the prevalence of clock skew and latch overhead. If these overheads consume a large proportion of the shrunken clock cycle, further pipelining yields diminishing returns and degrades hardware efficiency (e.g., the complex VAX 8600 versus the simpler, faster VAX 8700).
Pitfall: Evaluating scheduling strategies based on unoptimized code. Compilers eliminating redundant loads and operations tighten code, making scheduling significantly harder. Evaluating dynamic or static scheduling on loose, unoptimized code falsely inflates the perceived performance benefits of the scheduling hardware.

My Knowledge Base

Explorer

01 Pipelining