RV Datapath
Logic Design
Processor hardware consists of two kinds of elements.
- Combinational elements: Logic blocks (ALUs, adders, multiplexors) whose output depends only on current inputs. No internal storage — same inputs always produce the same output.
- State elements: Memory components (PC, register file, instruction memory, data memory) that store values across clock cycles. Restoring all state elements restores the full machine state.
Clocking methodology: All state element writes occur strictly on the rising clock edge. Combinational logic operates freely between edges. This allows a state element to be read, its value passed through combinational logic, and the result written back to the same element — all within one cycle, without race conditions.
Clock skew: The difference in arrival time of the clock edge at two state elements. The clock period must be padded to account for maximum skew.
Control vs data signals: Data signals carry values being processed (register contents, ALU results, memory data, PC). Control signals tell hardware what to do (RegWrite, MemRead, ALUSrc, etc.) — they are asserted (1) or deasserted (0) based on the current instruction.
Datapath
The datapath is built by asking what hardware each instruction class needs, then combining the pieces with multiplexors to share hardware across instruction types.
Instruction fetch (all instructions):
- PC addresses instruction memory; a hardwired adder computes PC + 4 for the next sequential instruction.
- RISC-V instructions are 32 bits (4 bytes), so sequential instructions are always 4 apart.
R-type (add, sub, and, or):
- Register file reads two source registers; ALU performs the operation; result writes back to the destination register.
- Two separate memories are required: instruction memory and data memory. A load must read both in the same cycle — a single-ported unified memory would cause a structural conflict.
Load / Store (ld, sd):
- ImmGen sign-extends the 12-bit offset to 64 bits.
- ALU computes
base register + offsetas the memory address. - Load reads data memory and writes the result to a register; store reads a second register and writes it to data memory.
Branch (beq):
- ALU subtracts the two source registers and asserts a
Zerosignal if they are equal. - A dedicated adder computes the branch target:
PC + (sign-extended offset << 1). The offset is shifted left by 1 because branch offsets encode half-word counts, giving a ±4 KiB range with 1-bit better resolution. - A multiplexor selects the next PC:
PCSrc = Branch AND Zero. If both are asserted, the branch is taken.
Multiplexors that unify the datapath:
| Mux | Input 0 | Input 1 | Control |
|---|---|---|---|
| ALU second input | Register value | Sign-extended immediate | ALUSrc |
| Register write data | ALU result | Data memory output | MemtoReg |
| Next PC | PC + 4 | Branch target | PCSrc |
The complete assembled single-cycle datapath with control signals:

Control Unit
The control unit maps the instruction opcode to the signals that steer the datapath. ALU control is decoupled into two levels to keep the main control simple.
Main control decodes the 7-bit opcode into a coarse ALUOp plus all non-ALU signals:
| Instruction | ALUSrc | MemtoReg | RegWrite | MemRead | MemWrite | Branch | ALUOp |
|---|---|---|---|---|---|---|---|
| R-type | 0 | 0 | 1 | 0 | 0 | 0 | 10 |
ld | 1 | 1 | 1 | 1 | 0 | 0 | 00 |
sd | 1 | X | 0 | 0 | 1 | 0 | 00 |
beq | 0 | X | 0 | 0 | 0 | 1 | 01 |
X = don’t care (signal value is irrelevant because the affected unit is not used).
ALU control refines ALUOp using funct3 and funct7:
| ALUOp | Meaning | ALU action |
|---|---|---|
00 | Load / Store | Always add (address calculation) |
01 | Branch | Always subtract (equality test) |
10 | R-type / I-type | Determined by funct3 / funct7 |
The final 4-bit ALU operation signal:
| ALU control | Operation |
|---|---|
0000 | AND |
0001 | OR |
0010 | Add |
0110 | Subtract |
Single-Cycle Execution
In a single-cycle implementation every instruction starts and finishes within one clock cycle. CPI is exactly 1, but the clock period is fixed by the slowest instruction — a load, which traverses:
Instruction memory → Register file → ALU → Data memory → Register file
Every other instruction, including a simple add, must wait for this same long cycle. This violates make the common case fast: the fast common case is penalised by the slow uncommon case. Single-cycle implementations are therefore not used in practice; pipelining overlaps instruction execution to amortise this cost.