The Processor

Logic Design and Clocking Methodology

Processor hardware consists of combinational elements that operate on data and sequential state elements that store data.

Combinational elements: Logic blocks (e.g., ALUs, adders, multiplexors) where the output depends exclusively on current inputs. These blocks contain no internal storage.
State elements: Memory components (e.g., registers, data memory) that store values. Outputs depend on both inputs and internal state.
Clocking methodology: Defines when data signals are valid and can be written to state elements.
Edge-triggered clocking: All state updates occur strictly on a clock edge (a quick transition from low to high). This permits reading a state element, passing its value through combinational logic, and writing the result back to the identical state element within the same clock cycle without race conditions.
Clock timing constraints: The clock period must accommodate the longest propagation delay through the logic.

$Cl oc k_{p} er i o d \geq t_{p ro p} + t_{co mbina t i o na l} + t_{se t u p}$

Clock skew: The absolute time difference between when two separate state elements perceive a clock edge. The clock period must be artificially lengthened to account for maximum clock skew to prevent signals from racing forward incorrectly.

The reliable synchronization of state elements via edge-triggered clocking forms the foundation for constructing a processor datapath capable of sequential instruction execution.

Datapath Construction and Single-Cycle Execution

The datapath merges required functional components to execute instructions. In a single-cycle implementation, every instruction executes entirely within one clock cycle.

Instruction Fetch: The Program Counter (PC) specifies the address to read from Instruction Memory. An adder permanently wired to add 4 increments the PC for the next sequential instruction.
Registers: The Register File contains 32 general-purpose 64-bit registers. It features two independent read ports and one write port. It can be read and written in the same clock cycle; reads yield the value written in the first half of the cycle.
Execution: The Arithmetic Logic Unit (ALU) performs arithmetic (addition, subtraction) and logical (AND, OR) operations. It evaluates branch conditions by subtracting operands and outputting a Zero signal if they are equal.
Memory Access: The Data Memory unit reads or writes data. It requires explicit read and write control signals because invalid reads can trigger exceptions.
Immediate Generation: The ImmGen unit extracts the 12-bit constant field from I-type, S-type, and SB-type instructions and sign-extends it to a 64-bit value for ALU consumption or branch address calculation.
Multiplexors: Used to steer data paths when multiple sources converge on a single input.
- ALU input: Selects between a register read value and a sign-extended immediate.
- Register write data: Selects between the ALU result and the Data Memory read output.
- PC next address: Selects between the sequentially incremented PC and the calculated branch target address.

Orchestrating these distinct datapath components requires a centralized control mechanism to decode instructions and generate precise hardware steering signals.

Control Unit Implementation

The control unit translates the binary instruction fields into active hardware signals.

Instruction Decoding: The 7-bit opcode field determines the active datapath elements.
Main Control Signals:
- RegWrite: Asserts to write data into the destination register.
- ALUSrc: Selects the second ALU operand (0 = register, 1 = sign-extended immediate).
- PCSrc: Selects the PC input (0 = PC + 4, 1 = branch target address).
- MemRead / MemWrite: Asserts to read or write Data Memory.
- MemtoReg: Selects the value routed to the Register File write port (0 = ALU result, 1 = Data Memory).
Two-Level ALU Control: To reduce main control complexity, ALU control is decoupled. The main control generates a 2-bit ALUOp signal (00 for loads/stores, 01 for branches, 10 for R-type). A secondary ALU Control block evaluates ALUOp alongside the instruction’s funct7 and funct3 fields to output a definitive 4-bit signal directing the ALU hardware.
Truth Tables: Combinational control logic is explicitly defined via truth tables linking opcode inputs to output signals. Unused conditions are mapped as “don’t-care” terms to optimize hardware synthesis.

While a single-cycle control scheme executes instructions functionally, its clock cycle is strictly bottlenecked by the longest instruction path (typically the load instruction), necessitating overlapped execution models for scalable performance.

Pipelining Fundamentals

Pipelining accelerates execution by overlapping multiple instructions in hardware.

Pipeline Stages: RISC-V divides execution into five distinct stages:
1. IF (Instruction Fetch): Fetch instruction from memory and increment PC.
2. ID (Instruction Decode): Read registers and compute control signals.
3. EX (Execution): Calculate memory addresses, compute ALU results, or evaluate branch conditions.
4. MEM (Memory Access): Read or write Data Memory.
5. WB (Write-Back): Write ALU or memory results back into the Register File.
Performance Metrics: Pipelining improves total instruction throughput rather than individual instruction latency. A five-stage pipeline theoretically operates up to five times faster than a non-pipelined equivalent, governed by the duration of the slowest physical stage.
Pipeline Registers: To pass data between stages across clock boundaries, state elements (IF/ID, ID/EX, EX/MEM, MEM/WB) are inserted between all stages. They hold instructions, data, and propagating control signals until they are strictly consumed.

Overlapping instruction execution vastly improves throughput but introduces physical and chronological conflicts when sequential instructions possess strict temporal dependencies.

Pipeline Hazards and Forwarding

Hazards are execution scenarios where the next planned instruction cannot execute in the subsequent clock cycle.

Structural Hazards: The hardware lacks the resources to support simultaneous instructions (e.g., utilizing a single unified memory for both instructions and data requires stalling the fetch stage during a data load).
Data Hazards: An instruction depends on data produced by an earlier instruction still traversing the pipeline.
Forwarding (Bypassing): Resolves data hazards by routing data directly from internal pipeline registers (EX/MEM or MEM/WB) back to the ALU inputs, bypassing the Register File WB stage.
Hazard Detection Conditions:
- EX Hazard: Forward from EX/MEM if EX/MEM.RegWrite is asserted and EX/MEM.RegisterRd matches ID/EX.RegisterRs1 or Rs2.
- MEM Hazard: Forward from MEM/WB if MEM/WB.RegWrite is asserted, MEM/WB.RegisterRd matches source registers, and an EX hazard does not independently govern the same register.
- Zero Register Exception: Hardware strictly prevents forwarding if the destination register is x0, preserving its immutable zero state.
Load-Use Data Hazard: Occurs when a load instruction is followed immediately by an instruction requiring its data. Forwarding cannot resolve this because the data is not fetched until the MEM stage.
Pipeline Stalls (Bubbles): To resolve load-use hazards, a Hazard Detection Unit situated in the ID stage stalls the pipeline.
- Condition: if (ID/EX.MemRead and ((ID/EX.RegisterRd = IF/ID.RegisterRs1) or (ID/EX.RegisterRd = IF/ID.RegisterRs2)))
- Action: The unit preserves the PC and IF/ID registers and forces the EX, MEM, and WB control signals to 0 (a nop instruction), delaying the dependent instruction by one cycle until forwarding can safely supply the data.

While data hazards stall sequential calculations, branching instructions inject uncertainty into the program counter, requiring specialized control hardware to gracefully correct invalidated pipeline states.

Control Hazards and Exceptions

Control hazards occur because the pipeline must fetch the next instruction before a branch decision calculates the subsequent program flow.

Assume Branch Not Taken: The pipeline defaults to fetching the sequentially incremented PC + 4. If the branch is taken, the hardware must discard the incorrectly fetched instructions.
Flushing: Clearing instructions requires changing control values in the pipeline registers to 0. An IF.Flush signal zeroes the instruction field of the IF/ID register, transforming the erroneously fetched instruction into a nop.
Branch Optimization: To minimize flush penalties, branch execution (target address calculation and equality testing) is moved from the EX stage backward to the ID stage. This reduces the misprediction penalty from multiple cycles down to a single cycle but requires distinct ID-stage forwarding logic and comparator hardware.
Dynamic Branch Prediction: Hardware predictors buffer the recent history of branch behaviors (taken or untaken) to forecast future outcomes. Mispredictions stall the pipeline and flush erroneously loaded paths.
Exceptions: Hardware malfunctions or undefined instructions necessitate immediate pipeline interruption.
- The address of the faulting instruction is saved to the Supervisor Exception Program Counter (SEPC), and the cause is logged in the SCAUSE register.
- The pipeline is flushed (using IF.Flush, ID.Flush, and EX.Flush signals) to prevent volatile architectural state changes, and the PC is hardwired to the operating system’s exception handler address.

Once fundamental pipeline hazards and exceptions are mitigated, architectural performance boundaries are pushed further by extracting simultaneous parallel execution natively from the instruction stream.

Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism identifies and overlaps independent instructions within a sequential program.

Pipeline Depth: Increasing pipeline stages (e.g., 14 stages) shortens the clock cycle time, overlapping higher volumes of instructions sequentially.
Multiple Issue: Duplicating internal datapath components (ALUs, memory ports) allows the processor to launch multiple instructions per clock cycle, driving the Clock Cycles Per Instruction (CPI) below 1.0.
Static Multiple Issue: The compiler statically groups instructions into issue packets (e.g., Very Long Instruction Word, VLIW) guaranteeing no internal dependences. If no independent instructions exist, the compiler inserts nop instructions.
Code Scheduling and Unrolling: Compilers reorder execution and unroll loops to separate dependent instructions and eliminate branch overhead.
Register Renaming: Unrolling utilizes extra architectural registers to eliminate anti-dependences (name dependences), where an instruction ordering is artificially forced purely by the reuse of a register name rather than actual data flow.

Compilers can only statically deduce so much parallelism; unpredictable runtime events like cache misses require hardware architectures that dynamically reorganize execution flows.

Dynamic Pipeline Scheduling and Superscalar Execution

Dynamic multiple-issue processors (superscalars) utilize hardware to analyze data flow and reorder execution at runtime.

Out-of-Order Execution: Instructions are fetched and decoded sequentially but execute dynamically based on data availability, preventing unpredictable stalls (like memory latency) from halting the entire processor pipeline.
Reservation Stations: Buffers located at the inputs of functional units that hold pending operations and their operands until all data dependencies are resolved.
Reorder Buffer and Commit Unit: Because instructions finish out of sequence, results are temporarily housed in a reorder buffer. The commit unit guarantees in-order commit, writing results to architectural registers or memory strictly in the program’s original sequence. This provides a precise exception model; if a fault occurs, the processor abandons uncommitted speculative results and gracefully reverts to the exact execution state.
Hardware Speculation: Advanced superscalars combine out-of-order execution with dynamic branch prediction to speculatively execute deep into unverified execution paths, discarding results smoothly via the reorder buffer if the speculation proves incorrect.

My Knowledge Base

Explorer

04 The Processor