Why RISC-V?

RISC-V is an open, modular ISA designed to run across the full computing spectrum — from embedded microcontrollers to supercomputers — without being owned or discontinued by any single company.

Modular vs. incremental design

Conventional architectures grow incrementally: every new processor must implement all past extensions to preserve binary compatibility. x86 expanded from 80 instructions to over 3,600, growing at roughly three per month.

RISC-V breaks this pattern with a strictly modular approach:

A frozen, minimal base ISA (RV32I) that can run a complete software stack.
Optional standard extensions for specific needs — hardware only includes what it requires.
If software invokes an omitted extension, the hardware traps and a software library handles it.

Design metrics

Seven measures govern RISC-V architectural decisions:

Cost: Die area scales non-linearly with cost ( $cost \approx f (die area^{2})$ ). Smaller dies improve both cost and yield. A RISC-V core requires roughly half the die area of an equivalent ARM-32 core.
Simplicity: Complex instructions are often ignored by compilers anyway. Simpler ISAs reduce design and verification cost.
Performance: A simpler ISA may need more instructions per program but enables faster clocks and lower CPI.
Implementation isolation: ISA features optimized for one microarchitecture generation must not penalize future ones. Examples of past mistakes: delayed branches (helped 5-stage pipelines, hurt out-of-order cores) and load-multiple (good for single-issue, bad for multi-issue scheduling).
Room for growth: Opcode space must be reserved for future custom accelerators. Exhausting it forces workarounds like separate 16-bit ISAs toggled via address bits.
Program size: Smaller binaries reduce instruction cache misses and DRAM power. Combining 32-bit and 16-bit compressed instructions beats variable-length encodings burdened by legacy prefixes.
Ease of programming and compiling: 32 integer registers simplify register allocation versus 8 or 16. Native PC-relative addressing supports position-independent code and dynamic linking.

Standard extensions

Extension	Name	Purpose
M	Multiply/Divide	Integer multiply and divide; omitted on minimal embedded chips
F	Single-precision FP	IEEE 754 single-precision floating point
D	Double-precision FP	IEEE 754 double-precision floating point
A	Atomic	Load-reserved/store-conditional and AMOs for multiprocessor sync
C	Compressed	16-bit encodings of common 32-bit instructions; ~400 gates of decoder overhead
V	Vector	Dynamic vector length and type per register, replacing fixed SIMD
RV64	64-bit	Widens registers and adds doubleword variants; preserves RV32 structure
Privileged	System	Machine/Supervisor/User modes, hardware paging, OS execution

RV32G (or RV64G) denotes the combined IMAFD base — the standard general-purpose configuration.

Stability

The complete RISC-V specification is ~236 pages. Equivalent incremental architectures require 2,100–2,700 pages. A frozen base ISA paired with openly debated optional extensions keeps the compiler and OS targets stable indefinitely.

RV32I

RV32I is the frozen base integer ISA — 32-bit registers, 32-bit fixed-width instructions, enough to run a complete software stack.

Registers

32 general-purpose 32-bit registers (x0–x31).
x0 is hardwired to zero, eliminating the need for dedicated zero-state or unary instructions. Moves and negations are synthesized using standard instructions with x0 as a source.
The PC is separate from the register file. Keeping it out of the general registers prevents arbitrary arithmetic from causing control-flow side effects, which stabilizes branch prediction.

Instruction formats

Six fixed 32-bit formats cover all instruction classes:

R-type: Register-to-register operations (two sources, one destination).
I-type: Short immediates and loads.
S-type: Stores.
B-type: Conditional branches — a rotated variant of S-type.
U-type: Long immediates (20-bit upper).
J-type: Unconditional jumps — a rotated variant of U-type.

Key encoding choices:

Fixed register specifiers: rs1, rs2, and rd sit in identical bit positions across all formats, so register reads begin before decoding completes.
Sign bit always at bit 31: Immediate sign extension runs in parallel with decode.
Rotated immediates: Immediate bits are scattered to minimize signal fanout and hardware multiplexing cost — the B and J scrambling looks odd but reduces wiring.
Trap patterns: All-zeros and all-ones are illegal instructions, catching out-of-bounds jumps and unprogrammed memory.

Integer computation

Arithmetic: add, sub; addi (no subi — use negative immediate).
Logical: and, or, xor and their immediate forms.
Shifts: sll, srl, sra (logical/arithmetic) with register or immediate shift amount.
Comparison: slt, sltu, slti, sltiu — write 1 to rd if true, 0 otherwise.
Upper immediates: lui loads a 20-bit constant into the upper 20 bits; auipc adds it to the PC. Combining either with a 12-bit immediate instruction synthesizes any 32-bit constant or PC-relative address in two instructions.

Multiply, divide, and overflow detection are excluded to keep the minimal hardware footprint small. Overflow is handled in software; multiply/divide live in the M extension.

Loads and stores

Single addressing mode: base register + sign-extended 12-bit immediate.

lw / sw: 32-bit word.
lh / sh: 16-bit halfword; loads sign-extend to 32 bits.
lb / sb: 8-bit byte; loads sign-extend to 32 bits.
lhu / lbu: zero-extending unsigned variants.

Memory is little-endian. Unaligned accesses are supported natively. No push/pop — stack operations are just sw/lw with the stack pointer register and displacement addressing.

Conditional branches

Branches compare two registers directly — no condition codes. Condition codes create implicit dependencies that stall out-of-order pipelines.

beq, bne: equality.
blt, bge: signed magnitude.
bltu, bgeu: unsigned magnitude.

Inverse comparisons use swapped operands ( $x < y ⟹ y > x$ ). The 12-bit immediate is multiplied by 2, sign-extended, and added to the PC. No delayed branches — they were removed to avoid binding the ISA to any particular pipeline depth.

Unconditional jumps

jal: PC-relative jump using a 20-bit immediate × 2; saves PC+4 to rd (return address).
jalr: Register-indirect jump using base + 12-bit immediate; saves PC+4 to rd.

Setting rd = x0 discards the link, giving a plain jump or subroutine return.

System instructions

CSR instructions (csrrw, csrrs, csrrc + immediate variants): read/write hardware counters — cycle timer, wall-clock time, instruction retirement count.
ecall: Request a service from the OS or execution environment.
ebreak: Transfer control to the debugger.
fence: Order I/O and memory accesses visible to other threads or devices.
fence.i: Flush the instruction pipeline so recent stores are visible to instruction fetch.

The ABI and OS conventions give these instructions their meaning — ecall alone does nothing without a defined calling convention.

Encoding reference

RISC-V Assembly

Calling convention

Function execution follows a fixed lifecycle: place arguments, jump (jal), acquire local storage and save registers, execute, place result and release storage, return (ret).

Registers are partitioned by preservation guarantee:

Temporaries — not preserved across a call: arguments and return values (a0–a7), temporaries (t0–t6), return address (ra).
Saved registers — callee must preserve: s0–s11, stack pointer (sp).
Hardwired zero — x0 always reads as 0.

Stack frame:

Prologue: addi sp, sp, -framesize to allocate; save registers to stack (e.g., sw ra, framesize-4(sp)).
Epilogue: restore registers, addi sp, sp, framesize, then ret.

RV32E: embedded variant that cuts the register file to 16 (x0–x15) to reduce die area.

Assembler directives and pseudoinstructions

Directives control data placement and code structure:

Directive	Effect
`.text`	Subsequent items go into the code section
`.data`	Subsequent items go into initialized data
`.bss`	Subsequent items go into zero-initialized data
`.section .foo`	Subsequent items go into section `.foo`
`.align n`	Align next datum to $2^{n}$ -byte boundary
`.balign n`	Align next datum to exact $n$ -byte boundary
`.globl sym`	Export `sym` as globally visible
`.string "str"`	Store null-terminated string
`.byte b1,...`	Store 8-bit values
`.half w1,...`	Store 16-bit halfwords
`.word w1,...`	Store 32-bit words
`.dword w1,...`	Store 64-bit doublewords
`.float f1,...`	Store single-precision FP values
`.double d1,...`	Store double-precision FP values
`.option rvc` (`norvc`)	Enable/disable compressed instruction emission
`.option pic` (`nopic`)	Enable/disable position-independent code
`.option relax` (`norelax`)	Enable/disable linker relaxation
`.option push` (`pop`)	Save/restore current option state

Pseudoinstructions map to one or more real instructions:

nop → addi x0, x0, 0
ret → jalr x0, x1, 0
mv rd, rs → addi rd, rs, 0
beqz rs, offset → beq rs, x0, offset
li rd, imm → lui + addi sequence for arbitrary 32-bit constants
la rd, symbol → auipc + offset for PC-relative symbol addresses

Memory layout

High addresses
┌─────────────┐
│    Stack    │  grows downward
├─────────────┤
│      ↓      │
│      ↑      │
├─────────────┤
│    Heap     │  grows upward (dynamic allocation)
├─────────────┤
│ Static data │  globals, constants
├─────────────┤
│    Text     │  machine instructions (starts at 0x00010000)
└─────────────┘
Low addresses

Position-independent code (PIC): uses PC-relative addressing (auipc, jalr) so the binary runs correctly regardless of where it is loaded in memory.

ABIs:

ilp32: The C language data types int, long, and pointers are 32 bits; FP arguments pass through integer registers.
ilp32f , ilp32d: single- or double-precision FP arguments pass through dedicated FP registers.

Linking

Linker relaxation: the linker replaces multi-instruction call sequences (auipc + jalr) with a single shorter instruction when the target is within ±2 KiB of the global pointer (gp) or thread pointer (tp).

Static linking: all library code is copied into the executable. Wastes memory when multiple programs share the same library, and ties the binary to a fixed library version.

Dynamic linking: libraries are mapped into memory at the moment of first call.

First call hits a 3-instruction stub that invokes the dynamic linker, which maps the function and patches the symbol table pointer.
Subsequent calls jump directly through the updated pointer.
The library exists once in system memory regardless of how many processes use it.

Loader: the OS injects the binary into memory, starts the dynamic linker for any unresolved dependencies, and transfers control to the entry point.

RV32M

The M extension adds integer multiply and divide to the base ISA. It is optional — embedded chips that never need it can omit it entirely, with software fallback via trap.

Multiplication

Two 32-bit operands produce a 64-bit product. Rather than write to two destination registers at once, the result is retrieved in two separate instructions:

mul: lower 32 bits of the product (signed or unsigned — same bits either way).
mulh: upper 32 bits, both operands signed.
mulhu: upper 32 bits, both operands unsigned.
mulhsu: upper 32 bits, one signed and one unsigned — used as a substep in multi-word signed multiplication.

Overflow detection:

Unsigned: overflow absent if mulhu result is zero.
Signed: overflow absent if all bits of mulh match the sign bit of mul (0 for positive, 0xFFFFFFFF for negative).

Division and remainder

div / divu: signed and unsigned quotient.
rem / remu: signed and unsigned remainder.

No hardware trap on divide-by-zero. Software handles it with a beqz check on the divisor before the division instruction.

Design notes

Results go directly into general-purpose registers — no dedicated HI/LO registers like MIPS-32. Dedicated registers add architectural state, slow context switches, and require extra move instructions.
Compilers optimize constant division: powers of 2 use shifts (srl for unsigned $\div 2^{i}$ ); other constants use multiplication by an approximate reciprocal plus correction on the upper half.
ARM-32 had no hardware divide at all until 2005.

RV32F and RV32D

F adds single-precision (32-bit) and D adds double-precision (64-bit) floating-point, both conforming to IEEE 754-2008.

Registers and state

32 dedicated FP registers f0–f31, separate from the integer file. Doubling register bandwidth this way avoids widening the instruction register specifier fields.
f0 is a normal read-write register — unlike integer x0, it is not hardwired to zero.
When both F and D are implemented, single-precision operations use the lower 32 bits of the 64-bit f registers.

fcsr (floating-point control and status register):

frm — rounding mode: round-to-nearest-even (default), round-toward-zero, round-down, round-up, round-to-nearest-max-magnitude. Individual instructions can override via a static rounding mode argument.
fflags — five accrued exception flags: Invalid (NV), Divide-by-zero (DZ), Overflow (OF), Underflow (UF), Inexact (NX).

Loads, stores, and register transfers

flw / fsw: 32-bit load/store using base + 12-bit immediate, same addressing as integer.
fld / fsd: 64-bit load/store.
fmv.x.w: copy a single-precision value from f to x register (bit-exact, no conversion).
fmv.w.x: copy from x to f register.

Arithmetic

Standard: fadd, fsub, fmul, fdiv, fsqrt — all with .s (single) and .d (double) suffixes.
fmin / fmax: write the smaller or larger of two operands directly, no branch needed.
Fused multiply-add (R4 format — three sources, one destination):
- fmadd: $r d = rs 1 \times rs 2 + rs 3$
- fmsub: $r d = rs 1 \times rs 2 - rs 3$
- fnmadd: $r d = - (rs 1 \times rs 2) + rs 3$
- fnmsub: $r d = - (rs 1 \times rs 2) - rs 3$
- A single rounding step at the end gives higher precision and speed than a separate multiply followed by add.

Comparisons and control flow

No dedicated FP branch instructions. Instead, comparisons write a boolean into an integer register, and standard integer branches act on it:

feq.s/d, flt.s/d, fle.s/d — write 1 or 0 to an x register.

Conversion, sign injection, and classification

Conversion (fcvt family):

Between signed/unsigned 32-bit integers and single/double precision in both directions.
Between single and double precision (fcvt.s.d, fcvt.d.s).

Sign injection — copies a value while manipulating only its sign bit:

fsgnj: take sign from a second source.
fsgnjn: take inverted sign from a second source.
fsgnjx: XOR the sign bits of both sources.
These underpin pseudoinstructions: fabs uses fsgnjx ( $s \oplus s = 0$ ), fneg uses fsgnjn, fmv uses fsgnj.

Classification (fclass.s/d): writes a 10-bit one-hot mask to an integer register identifying which of the 10 IEEE 754 states the operand is in: $- \infty$ , negative normal, negative subnormal, $- 0$ , $+ 0$ , positive subnormal, positive normal, $+ \infty$ , signaling NaN, quiet NaN.

RV32A: Atomic Instructions

The A extension provides atomic instructions for synchronization in multiprocessor environments. All RV32A instructions require naturally aligned addresses — hardware cannot efficiently guarantee atomicity across cache-line boundaries.

LR/SC

LR/SC implements an atomic operation across two linked instructions, avoiding a three-operand instruction that would complicate the standard datapath.

lr.w (load reserved): reads a word from memory into a register and places a reservation on that address.
sc.w (store conditional): attempts to write to the reserved address.
- Succeeds: writes the value, sets destination register to 0.
- Fails (reservation broken by another hart): destination register gets a nonzero code; memory is unchanged.

This pair synthesizes any synchronization primitive, including compare-and-swap (CAS).

AMOs

AMOs execute a full read-modify-write as a single indivisible hardware operation — no interrupt or remote modification can occur between the read and the write.

Execution: read current value → apply ALU operation with a source register → write result back → return the original value to the destination register.

Instruction	Operation
`amoswap.w`	Swap
`amoadd.w`	Add
`amoand.w`, `amoor.w`, `amoxor.w`	Bitwise AND, OR, XOR
`amomin.w`, `amomax.w`	Signed min/max
`amominu.w`, `amomaxu.w`	Unsigned min/max

AMOs scale better than LR/SC polling loops in large multiprocessor systems and streamline atomic I/O device communication.

Memory ordering

RISC-V uses a relaxed memory model — harts may observe accesses out of program order. All RV32A instructions carry two annotation bits to enforce ordering at critical points:

aq (acquire): when set, this atomic op is visible before all subsequent memory accesses by this hart.
rl (release): when set, this atomic op is visible after all previous memory accesses by this hart.

Lock acquire sets aq to ensure the lock is held before guarded data is read. Lock release sets rl to ensure all data writes are visible before the lock is relinquished.

RV32C: Compressed Instructions

The C extension maps the most common 32-bit instructions to 16-bit encodings, shrinking binary size without changing the ISA visible to the compiler or programmer.

The assembler transparently picks the 16-bit form whenever possible — the compiler emits normal instructions and never knows.
The hardware decoder expands 16-bit instructions back to their 32-bit equivalents before execution, adding only ~400 gates of overhead.
The two lowest bits of every 32-bit instruction are always 11; any other pattern signals a 16-bit instruction, making 16/32-bit interleaving unambiguous.

RV64C diverges slightly from RV32C: it drops c.jal (rare in 64-bit code) and word-sized load/stores (c.lw, c.sw, c.flw, etc.), replacing them with 64-bit variants (c.ld, c.sd, c.addw, c.addiw, c.subw).

Combined with the 32-bit base, RV32GC produces binaries significantly smaller than architectures with fixed-width encodings.

My Knowledge Base

Explorer

01 Why RISC-V?