Why RISC-V?

RISC-V is an open, modular ISA designed to run across the full computing spectrum — from embedded microcontrollers to supercomputers — without being owned or discontinued by any single company.

Modular vs. incremental design

Conventional architectures grow incrementally: every new processor must implement all past extensions to preserve binary compatibility. x86 expanded from 80 instructions to over 3,600, growing at roughly three per month.

RISC-V breaks this pattern with a strictly modular approach:

A frozen, minimal base ISA (RV32I) that can run a complete software stack.
Optional standard extensions for specific needs — hardware only includes what it requires.
If software invokes an omitted extension, the hardware traps and a software library handles it.

Design metrics

Seven measures govern RISC-V architectural decisions:

Cost: Die area scales non-linearly with cost ( $cost \approx f (die area^{2})$ ). Smaller dies improve both cost and yield. A RISC-V core requires roughly half the die area of an equivalent ARM-32 core.
Simplicity: Complex instructions are often ignored by compilers anyway. Simpler ISAs reduce design and verification cost.
Performance: A simpler ISA may need more instructions per program but enables faster clocks and lower CPI.
Implementation isolation: ISA features optimized for one microarchitecture generation must not penalize future ones. Examples of past mistakes: delayed branches (helped 5-stage pipelines, hurt out-of-order cores) and load-multiple (good for single-issue, bad for multi-issue scheduling).
Room for growth: Opcode space must be reserved for future custom accelerators. Exhausting it forces workarounds like separate 16-bit ISAs toggled via address bits.
Program size: Smaller binaries reduce instruction cache misses and DRAM power. Combining 32-bit and 16-bit compressed instructions beats variable-length encodings burdened by legacy prefixes.
Ease of programming and compiling: 32 integer registers simplify register allocation versus 8 or 16. Native PC-relative addressing supports position-independent code and dynamic linking.

Standard extensions

Extension	Name	Purpose
M	Multiply/Divide	Integer multiply and divide; omitted on minimal embedded chips
F	Single-precision FP	IEEE 754 single-precision floating point
D	Double-precision FP	IEEE 754 double-precision floating point
A	Atomic	Load-reserved/store-conditional and AMOs for multiprocessor sync
C	Compressed	16-bit encodings of common 32-bit instructions; ~400 gates of decoder overhead
V	Vector	Dynamic vector length and type per register, replacing fixed SIMD
RV64	64-bit	Widens registers and adds doubleword variants; preserves RV32 structure
Privileged	System	Machine/Supervisor/User modes, hardware paging, OS execution

RV32G (or RV64G) denotes the combined IMAFD base — the standard general-purpose configuration.

Stability

The complete RISC-V specification is ~236 pages. Equivalent incremental architectures require 2,100–2,700 pages. A frozen base ISA paired with openly debated optional extensions keeps the compiler and OS targets stable indefinitely.

RV32I

RV32I is the frozen base integer ISA — 32-bit registers, 32-bit fixed-width instructions, enough to run a complete software stack.

Registers

32 general-purpose 32-bit registers (x0–x31).
x0 is hardwired to zero, eliminating the need for dedicated zero-state or unary instructions. Moves and negations are synthesized using standard instructions with x0 as a source.
The PC is separate from the register file. Keeping it out of the general registers prevents arbitrary arithmetic from causing control-flow side effects, which stabilizes branch prediction.

Instruction formats

Six fixed 32-bit formats cover all instruction classes:

R-type: Register-to-register operations (two sources, one destination).
I-type: Short immediates and loads.
S-type: Stores.
B-type: Conditional branches — a rotated variant of S-type.
U-type: Long immediates (20-bit upper).
J-type: Unconditional jumps — a rotated variant of U-type.

Key encoding choices:

Fixed register specifiers: rs1, rs2, and rd sit in identical bit positions across all formats, so register reads begin before decoding completes.
Sign bit always at bit 31: Immediate sign extension runs in parallel with decode.
Rotated immediates: Immediate bits are scattered to minimize signal fanout and hardware multiplexing cost — the B and J scrambling looks odd but reduces wiring.
Trap patterns: All-zeros and all-ones are illegal instructions, catching out-of-bounds jumps and unprogrammed memory.

Integer computation

Arithmetic: add, sub; addi (no subi — use negative immediate).
Logical: and, or, xor and their immediate forms.
Shifts: sll, srl, sra (logical/arithmetic) with register or immediate shift amount.
Comparison: slt, sltu, slti, sltiu — write 1 to rd if true, 0 otherwise.
Upper immediates: lui loads a 20-bit constant into the upper 20 bits; auipc adds it to the PC. Combining either with a 12-bit immediate instruction synthesizes any 32-bit constant or PC-relative address in two instructions.

Multiply, divide, and overflow detection are excluded to keep the minimal hardware footprint small. Overflow is handled in software; multiply/divide live in the M extension.

Loads and stores

Single addressing mode: base register + sign-extended 12-bit immediate.

lw / sw: 32-bit word.
lh / sh: 16-bit halfword; loads sign-extend to 32 bits.
lb / sb: 8-bit byte; loads sign-extend to 32 bits.
lhu / lbu: zero-extending unsigned variants.

Memory is little-endian. Unaligned accesses are supported natively. No push/pop — stack operations are just sw/lw with the stack pointer register and displacement addressing.

Conditional branches

Branches compare two registers directly — no condition codes. Condition codes create implicit dependencies that stall out-of-order pipelines.

beq, bne: equality.
blt, bge: signed magnitude.
bltu, bgeu: unsigned magnitude.

Inverse comparisons use swapped operands ( $x < y ⟹ y > x$ ). The 12-bit immediate is multiplied by 2, sign-extended, and added to the PC. No delayed branches — they were removed to avoid binding the ISA to any particular pipeline depth.

Unconditional jumps

jal: PC-relative jump using a 20-bit immediate × 2; saves PC+4 to rd (return address).
jalr: Register-indirect jump using base + 12-bit immediate; saves PC+4 to rd.

Setting rd = x0 discards the link, giving a plain jump or subroutine return.

System instructions

CSR instructions (csrrw, csrrs, csrrc + immediate variants): read/write hardware counters — cycle timer, wall-clock time, instruction retirement count.
ecall: Request a service from the OS or execution environment.
ebreak: Transfer control to the debugger.
fence: Order I/O and memory accesses visible to other threads or devices.
fence.i: Flush the instruction pipeline so recent stores are visible to instruction fetch.

The ABI and OS conventions give these instructions their meaning — ecall alone does nothing without a defined calling convention.

Encoding reference

RISC-V Assembly

Calling convention

Function execution follows a fixed lifecycle: place arguments, jump (jal), acquire local storage and save registers, execute, place result and release storage, return (ret).

Registers are partitioned by preservation guarantee:

Temporaries — not preserved across a call: arguments and return values (a0–a7), temporaries (t0–t6), return address (ra).
Saved registers — callee must preserve: s0–s11, stack pointer (sp).
Hardwired zero — x0 always reads as 0.

Stack frame:

Prologue: addi sp, sp, -framesize to allocate; save registers to stack (e.g., sw ra, framesize-4(sp)).
Epilogue: restore registers, addi sp, sp, framesize, then ret.

RV32E: embedded variant that cuts the register file to 16 (x0–x15) to reduce die area.

Assembler directives

Directives control data placement and code structure:

Directive	Effect
`.text`	Subsequent items go into the code section
`.data`	Subsequent items go into initialized data
`.bss`	Subsequent items go into zero-initialized data
`.section .foo`	Subsequent items go into section `.foo`
`.align n`	Align next datum to $2^{n}$ -byte boundary
`.balign n`	Align next datum to exact $n$ -byte boundary
`.globl sym`	Export `sym` as globally visible
`.string "str"`	Store null-terminated string
`.byte b1,...`	Store 8-bit values
`.half w1,...`	Store 16-bit halfwords
`.word w1,...`	Store 32-bit words
`.dword w1,...`	Store 64-bit doublewords
`.float f1,...`	Store single-precision FP values
`.double d1,...`	Store double-precision FP values
`.option rvc` (`norvc`)	Enable/disable compressed instruction emission
`.option pic` (`nopic`)	Enable/disable position-independent code
`.option relax` (`norelax`)	Enable/disable linker relaxation
`.option push` (`pop`)	Save/restore current option state

Pseudoinstructions

Arithmetic and register — missing operations synthesized using x0 as a zero source or identity immediates

Pseudo	Expands to
`nop`	`addi x0, x0, 0`
`neg rd, rs`	`sub rd, x0, rs`
`negw rd, rs`	`subw rd, x0, rs`
`mv rd, rs`	`addi rd, rs, 0`
`not rd, rs`	`xori rd, rs, -1`
`sext.w rd, rs`	`addiw rd, rs, 0`
`li rd, imm`	`lui`+`addi` sequence (arbitrary 32-bit constant)

Set conditions — hardware only has slt/sltu; all four cases are covered by swapping operands or using x0

Pseudo	Expands to
`seqz rd, rs`	`sltiu rd, rs, 1`
`snez rd, rs`	`sltu rd, x0, rs`
`sltz rd, rs`	`slt rd, rs, x0`
`sgtz rd, rs`	`slt rd, x0, rs`

Branches against zero — hardware only provides blt/bge; the rest swap operands or use x0

Pseudo	Expands to
`beqz rs, off`	`beq rs, x0, off`
`bnez rs, off`	`bne rs, x0, off`
`blez rs, off`	`bge x0, rs, off`
`bgez rs, off`	`bge rs, x0, off`
`bltz rs, off`	`blt rs, x0, off`
`bgtz rs, off`	`blt x0, rs, off`

Operand-swapped branches — since A > B ≡ B < A

Pseudo	Expands to
`bgt rs, rt, off`	`blt rt, rs, off`
`ble rs, rt, off`	`bge rt, rs, off`
`bgtu rs, rt, off`	`bltu rt, rs, off`
`bleu rs, rt, off`	`bgeu rt, rs, off`

Jumps and calls — omitting rd defaults to x0 (discard link) or x1 (save return address); call/tail use two instructions to reach any 32-bit offset

Pseudo	Expands to
`j off`	`jal x0, off`
`jr rs`	`jalr x0, rs, 0`
`ret`	`jalr x0, x1, 0`
`jal off`	`jal x1, off`
`jalr rs`	`jalr x1, rs, 0`
`call off`	`auipc x1, off[31:12]` + `jalr x1, x1, off[11:0]`
`tail off`	`auipc x6, off[31:12]` + `jalr x0, x6, off[11:0]`

Addressing and memory — all expand to auipc+load/store pairs, since a symbol address requires a 32-bit PC-relative offset that no single instruction can encode

Pseudo	Expands to
`la rd, sym`	`auipc`+`addi`/`lw`/`ld` (PC-relative or GOT)
`lla rd, sym`	`auipc`+`addi` (local PC-relative only)
`l{b\|h\|w\|d} rd, sym`	`auipc` + respective load
`s{b\|h\|w\|d} rs, sym, rt`	`auipc` + respective store
`fl{w\|d} rd, sym, rt`	`auipc`+`flw`/`fld`
`fs{w\|d} rs, sym, rt`	`auipc`+`fsw`/`fsd`

Floating-point — sign-injection hardware exploited for three common ops

Pseudo	Expands to
`fmv.s/d rd, rs`	`fsgnj.s/d rd, rs, rs`
`fabs.s/d rd, rs`	`fsgnjx.s/d rd, rs, rs`
`fneg.s/d rd, rs`	`fsgnjn.s/d rd, rs, rs`

CSR and counters — all route through csrrs/csrrw/csrrc with x0 to discard or supply a zero source

Pseudo	Expands to
`csrr rd, csr`	`csrrs rd, csr, x0`
`csrw csr, rs`	`csrrw x0, csr, rs`
`csrs csr, rs`	`csrrs x0, csr, rs`
`csrc csr, rs`	`csrrc x0, csr, rs`
`csrwi/csrsi/csrci csr, imm`	immediate variants of above
`rdcycle[h] rd`	`csrrs rd, cycle[h], x0`
`rdtime[h] rd`	`csrrs rd, time[h], x0`
`rdinstret[h] rd`	`csrrs rd, instret[h], x0`
`frcsr rd` / `fscsr rs`	read/write `fcsr` via `csrrs`/`csrrw`
`frrm rd` / `fsrm rs`	read/write `frm` via `csrrs`/`csrrw`
`frflags rd` / `fsflags rs`	read/write `fflags` via `csrrs`/`csrrw`

Miscellaneous — bare fence with no operands is shorthand for the maximally conservative ordering

fence — defaults to fence iorw, iorw (all memory and I/O).

Memory layout

High addresses
┌─────────────┐
│    Stack    │  grows downward
├─────────────┤
│      ↓      │
│      ↑      │
├─────────────┤
│    Heap     │  grows upward (dynamic allocation)
├─────────────┤
│ Static data │  globals, constants
├─────────────┤
│    Text     │  machine instructions (starts at 0x00010000)
└─────────────┘
Low addresses

Position-independent code (PIC): uses PC-relative addressing (auipc, jalr) so the binary runs correctly regardless of where it is loaded in memory.

ABIs:

ilp32: The C language data types int, long, and pointers are 32 bits; FP arguments pass through integer registers.
ilp32f , ilp32d: single- or double-precision FP arguments pass through dedicated FP registers.

Linking

Linker relaxation: the linker replaces multi-instruction call sequences (auipc + jalr) with a single shorter instruction when the target is within ±2 KiB of the global pointer (gp) or thread pointer (tp).

Static linking: all library code is copied into the executable. Wastes memory when multiple programs share the same library, and ties the binary to a fixed library version.

Dynamic linking: libraries are mapped into memory at the moment of first call.

First call hits a 3-instruction stub that invokes the dynamic linker, which maps the function and patches the symbol table pointer.
Subsequent calls jump directly through the updated pointer.
The library exists once in system memory regardless of how many processes use it.

Loader: the OS injects the binary into memory, starts the dynamic linker for any unresolved dependencies, and transfers control to the entry point.

RV32M

The M extension adds integer multiply and divide to the base ISA. It is optional — embedded chips that never need it can omit it entirely, with software fallback via trap.

Multiplication

Two 32-bit operands produce a 64-bit product. Rather than write to two destination registers at once, the result is retrieved in two separate instructions:

mul: lower 32 bits of the product (signed or unsigned — same bits either way).
mulh: upper 32 bits, both operands signed.
mulhu: upper 32 bits, both operands unsigned.
mulhsu: upper 32 bits, one signed and one unsigned — used as a substep in multi-word signed multiplication.

Overflow detection:

Unsigned: overflow absent if mulhu result is zero.
Signed: overflow absent if all bits of mulh match the sign bit of mul (0 for positive, 0xFFFFFFFF for negative).

Division and remainder

div / divu: signed and unsigned quotient.
rem / remu: signed and unsigned remainder.

No hardware trap on divide-by-zero. Software handles it with a beqz check on the divisor before the division instruction.

Design notes

Results go directly into general-purpose registers — no dedicated HI/LO registers like MIPS-32. Dedicated registers add architectural state, slow context switches, and require extra move instructions.
Compilers optimize constant division: powers of 2 use shifts (srl for unsigned $\div 2^{i}$ ); other constants use multiplication by an approximate reciprocal plus correction on the upper half.
ARM-32 had no hardware divide at all until 2005.

RV32F and RV32D

F adds single-precision (32-bit) and D adds double-precision (64-bit) floating-point, both conforming to IEEE 754-2008.

Registers and state

32 dedicated FP registers f0–f31, separate from the integer file. Doubling register bandwidth this way avoids widening the instruction register specifier fields.
f0 is a normal read-write register — unlike integer x0, it is not hardwired to zero.
When both F and D are implemented, single-precision operations use the lower 32 bits of the 64-bit f registers.

fcsr (floating-point control and status register):

frm — rounding mode: round-to-nearest-even (default), round-toward-zero, round-down, round-up, round-to-nearest-max-magnitude. Individual instructions can override via a static rounding mode argument.
fflags — five accrued exception flags: Invalid (NV), Divide-by-zero (DZ), Overflow (OF), Underflow (UF), Inexact (NX).

Loads, stores, and register transfers

flw / fsw: 32-bit load/store using base + 12-bit immediate, same addressing as integer.
fld / fsd: 64-bit load/store.
fmv.x.w: copy a single-precision value from f to x register (bit-exact, no conversion).
fmv.w.x: copy from x to f register.

Arithmetic

Standard: fadd, fsub, fmul, fdiv, fsqrt — all with .s (single) and .d (double) suffixes.
fmin / fmax: write the smaller or larger of two operands directly, no branch needed.
Fused multiply-add (R4 format — three sources, one destination):
- fmadd: $r d = rs 1 \times rs 2 + rs 3$
- fmsub: $r d = rs 1 \times rs 2 - rs 3$
- fnmadd: $r d = - (rs 1 \times rs 2) + rs 3$
- fnmsub: $r d = - (rs 1 \times rs 2) - rs 3$
- A single rounding step at the end gives higher precision and speed than a separate multiply followed by add.

Comparisons and control flow

No dedicated FP branch instructions. Instead, comparisons write a boolean into an integer register, and standard integer branches act on it:

feq.s/d, flt.s/d, fle.s/d — write 1 or 0 to an x register.

Conversion, sign injection, and classification

Conversion (fcvt family):

Between signed/unsigned 32-bit integers and single/double precision in both directions.
Between single and double precision (fcvt.s.d, fcvt.d.s).

Sign injection — copies a value while manipulating only its sign bit:

fsgnj: take sign from a second source.
fsgnjn: take inverted sign from a second source.
fsgnjx: XOR the sign bits of both sources.
These underpin pseudoinstructions: fabs uses fsgnjx ( $s \oplus s = 0$ ), fneg uses fsgnjn, fmv uses fsgnj.

Classification (fclass.s/d): writes a 10-bit one-hot mask to an integer register identifying which of the 10 IEEE 754 states the operand is in: $- \infty$ , negative normal, negative subnormal, $- 0$ , $+ 0$ , positive subnormal, positive normal, $+ \infty$ , signaling NaN, quiet NaN.

RV32A

The A extension provides atomic instructions for synchronization in multiprocessor environments. All RV32A instructions require naturally aligned addresses — hardware cannot efficiently guarantee atomicity across cache-line boundaries.

LR/SC

LR/SC implements an atomic operation across two linked instructions, avoiding a three-operand instruction that would complicate the standard datapath.

lr.w (load reserved): reads a word from memory into a register and places a reservation on that address.
sc.w (store conditional): attempts to write to the reserved address.
- Succeeds: writes the value, sets destination register to 0.
- Fails (reservation broken by another hart): destination register gets a nonzero code; memory is unchanged.

This pair synthesizes any synchronization primitive, including compare-and-swap (CAS).

AMOs

AMOs execute a full read-modify-write as a single indivisible hardware operation — no interrupt or remote modification can occur between the read and the write.

Execution: read current value → apply ALU operation with a source register → write result back → return the original value to the destination register.

Instruction	Operation
`amoswap.w`	Swap
`amoadd.w`	Add
`amoand.w`, `amoor.w`, `amoxor.w`	Bitwise AND, OR, XOR
`amomin.w`, `amomax.w`	Signed min/max
`amominu.w`, `amomaxu.w`	Unsigned min/max

AMOs scale better than LR/SC polling loops in large multiprocessor systems and streamline atomic I/O device communication.

Memory ordering

RISC-V uses a relaxed memory model — harts may observe accesses out of program order. All RV32A instructions carry two annotation bits to enforce ordering at critical points:

aq (acquire): when set, this atomic op is visible before all subsequent memory accesses by this hart.
rl (release): when set, this atomic op is visible after all previous memory accesses by this hart.

Lock acquire sets aq to ensure the lock is held before guarded data is read. Lock release sets rl to ensure all data writes are visible before the lock is relinquished.

RV32C: Compressed Instructions

The C extension maps the most common 32-bit instructions to 16-bit encodings, shrinking binary size without changing the ISA visible to the compiler or programmer.

The assembler transparently picks the 16-bit form whenever possible — the compiler emits normal instructions and never knows.
The hardware decoder expands 16-bit instructions back to their 32-bit equivalents before execution, adding only ~400 gates of overhead.
The two lowest bits of every 32-bit instruction are always 11; any other pattern signals a 16-bit instruction, making 16/32-bit interleaving unambiguous.

RV64C diverges slightly from RV32C: it drops c.jal (rare in 64-bit code) and word-sized load/stores (c.lw, c.sw, c.flw, etc.), replacing them with 64-bit variants (c.ld, c.sd, c.addw, c.addiw, c.subw).

Combined with the 32-bit base, RV32GC produces binaries significantly smaller than architectures with fixed-width encodings.

RV32V

The V extension replaces SIMD with a true vector model: vector length and element width are decoupled from the instruction encoding, so a single binary scales across hardware implementations without recompilation.

Registers and dynamic typing

32 vector registers v0–v31; data type and element width are set per-register, not per-opcode.
vsetdcfg: configures which registers are active and their types — X8/16/32/64, X8U/..., F16/F32/F64.
Maximum vector length (mvl): computed at runtime from total vector SRAM divided by the active element types. Disabling unused registers reallocates their memory, increasing mvl for active ones.
Only enabled registers are saved/restored on context switch — unused registers cost nothing.

Computation instructions

Arithmetic, logical, and FP operations from the base ISAs carry over with operand-type suffixes:
- .vv — both operands are vector registers.
- .vs — vector op scalar (x or f register).
- .sv — scalar op vector (for asymmetric ops like Y = a - X).
Fused multiply-add uses three-source suffixes: .vvv, .vvs, .vsv, .vss.

Loads and stores

Mode	Instructions	Use case
Sequential	`vld` / `vst`	Contiguous arrays; 7-bit immediate offset scaled by element size
Strided	`vlds` / `vsts`	Multi-dimensional arrays; base register + byte-stride register
Indexed (gather/scatter)	`vldx` / `vstx`	Sparse arrays; base register + vector of byte offsets

Vector length and loop control

setvl: sets vl = min(requested, mvl). Handles arrays of any length — including edge cases and zero — without separate fringe logic. Eliminates strip-mining.

Conditional execution

8 predicate registers vp0–vp7, each holding mvl bits. A 1 bit allows the element to be written; 0 leaves it unchanged.
Comparison instructions (vplt, vpeq, etc.) populate a predicate register from a vector condition.
Operations specify vp0 or vp1 as their mask; vpswap moves another predicate into the active slot.
Predicate logic: vpand, vpor, vpxor, vpnot for compound conditions.

Permutation and reduction

vselect: gather elements from a source vector using indices from a second vector.
vmerge: merge elements from two vectors based on a predicate mask.
vextract: copy a subset of elements from a calculated offset to the start of a destination vector; enables binary-halving reductions (e.g., sum all elements by iteratively halving until vl = 1).

Vector vs. SIMD

	Vector (RV32V)	SIMD (x86 AVX / ARM NEON)
Code size	No strip-mining; `setvl` handles all lengths	Requires loop bookkeeping and fringe handling
Instruction count	10–20× fewer dynamic instructions	Short registers force many iterations
ISA stability	Static; `mvl` expands transparently with wider hardware	Hundreds of new opcodes whenever register width increases

RV64

RV64 widens all registers (including the PC) to 64 bits and adds a minimal set of word/doubleword variants to the RV32 base. The architecture structure is preserved — no overhaul required.

RV64I additions

Word arithmetic: addw, addiw, subw — compute in 32 bits, sign-extend result to 64 bits.
Word shifts: sllw, slliw, srlw, srliw, sraw, sraiw — explicit 32-bit shift results.
Doubleword memory: ld / sd — transfer 8 bytes at a time.
Unsigned word load: lwu zero-extends a 32-bit load to 64 bits; the existing lw sign-extends.

Extension adaptations

RV64M: adds word multiply/divide/remainder variants — mulw, divw, divuw, remw, remuw.
RV64A: adds doubleword variants for all 11 atomics — amoadd.d, lr.d, sc.d, etc.
RV64F/D: adds long (64-bit integer) conversion instructions — fcvt.l.s, fcvt.lu.d, etc. RV64D also adds fmv.x.d and fmv.d.x for direct 64-bit moves between integer and FP registers.
RV64C: drops c.jal and all word-sized compressed loads/stores; replaces them with 64-bit variants — c.ld, c.sd, c.addw, c.addiw, c.subw, c.ldsp, c.sdsp.

ABIs

lp64: The C data types long and pointers are 64 bits; int stays 32 bits; FP args use integer registers.
lp64f , lp64d: single- or double-precision FP arguments pass through FP registers.

Code density

RV64GC is only 1% larger than RV32GC. Compared to other 64-bit ISAs:

23% smaller than ARM-64 (which dropped Thumb-2 compressed format entirely).
34% smaller than x86-64 (which burns bytes on legacy prefix encoding).

Comparison to other 64-bit ISAs

x86-64: extended x86-32 by doubling registers and adding PC-relative data addressing, but required prefix bytes to fit new operations into an already-full opcode space — inflating average instruction length.
ARM-64: invented a brand-new 1000+ instruction ISA rather than extending ARM-32. Gained 31 registers and a hardwired zero, but dropped Thumb-2, making ARM-64 code 25% larger than ARM Thumb-2.
RISC-V: RV32 and RV64 were engineered simultaneously, so 64-bit instructions were never forced into a cramped 32-bit opcode space. RV64I retains virtually all RV32I instructions, keeping the compiler transition simple.

RV Privileged Architecture

Three privilege levels isolate hardware access, OS execution, and application code.

Mode	Level	Purpose
Machine (M)	Highest, mandatory	Full hardware access; bootstraps and controls the system
Supervisor (S)	Optional	OS execution; virtual memory and multitasking
User (U)	Lowest	Untrusted applications; restricted CSR and memory access

Machine mode (M-mode)

The most important feature of M-mode is the ability to intercept and handle exceptions — unusual runtime events. All exceptions are precise: instructions before the faulting instruction complete; the faulting instruction and those after do not.

Synchronous exceptions: caused directly by instruction execution — the faulting instruction itself triggers the event (e.g. illegal opcode, misaligned access, ecall).
Asynchronous interrupts: caused by external events independent of the instruction stream — handling and masking are uniform across all RISC-V systems, though memory maps and interrupt controller mechanisms vary per platform.

Exception-handling CSRs — seven registers collectively capture the full exception state:

CSR	Purpose
`mtvec`	Handler base address; `MODE=1` vectorizes async interrupts to `BASE + 4×cause`
`mepc`	PC saved on exception entry (sync: faulting instruction; async: resume point)
`mcause`	Exception cause; MSB=1 for interrupts, 0 for synchronous exceptions
`mtval`	Faulting address or exception-specific data (e.g. illegal instruction bits)
`mstatus`	Global interrupt enable (`MIE`), previous privilege (`MPP`), previous `MIE` (`MPIE`)
`mie` / `mip`	Per-source interrupt enable / pending bits; bit positions match `mcause` codes
`mscratch`	Scratch register; software points it to an in-memory context-save area

Bit layout of the mie and mip CSRs — each bit position corresponds to a mcause interrupt code.

mcause encoding — MSB=1 for interrupts, 0 for synchronous exceptions; lower bits identify the specific cause. Supervisor interrupts and page-fault exceptions only exist when S-mode is implemented.

Interrupt (MSB)	Code	Description
1	1	Supervisor software interrupt
1	3	Machine software interrupt
1	5	Supervisor timer interrupt
1	7	Machine timer interrupt
1	9	Supervisor external interrupt
1	11	Machine external interrupt
0	0	Instruction address misaligned
0	1	Instruction access fault
0	2	Illegal instruction
0	3	Breakpoint
0	4	Load address misaligned
0	5	Load access fault
0	6	Store address misaligned
0	7	Store access fault
0	8	Environment call from U-mode
0	9	Environment call from S-mode
0	11	Environment call from M-mode
0	12	Instruction page fault
0	13	Load page fault
0	15	Store page fault

Interrupt gating — an interrupt is taken only when all three hold simultaneously: mstatus.MIE=1 (global), the per-source mie bit is set (enabled), and the corresponding mip bit is set (pending). Timer interrupt example: mstatus.MIE=1, mie[7]=1, mip[7]=1.

On exception — hardware atomically performs these steps:

mepc ← PC, then PC ← mtvec.
mcause ← exception cause; mtval ← faulting address or exception-specific data.
MPIE ← MIE, then MIE ← 0 (disables further interrupts).
MPP ← current privilege mode, then privilege elevated to M-mode.

Handler prologue/epilogue — on entry, swap a register (e.g. a0) with mscratch to get a pointer to scratch space, then save all registers the body will use. On exit, restore those registers, swap a0/mscratch again, then execute mret. For a preemptible handler, also save mepc/mcause/mtval/mstatus to the stack before re-enabling interrupts — a nested exception would overwrite them.

On mret — reverses exception entry:

PC ← mepc.
MIE ← MPIE (restores interrupt enable).
Privilege mode ← MPP.

wfi — informs the processor there is no useful work; it enters a low-power state until (mie & mip) ≠ 0. Typically used inside a loop. If MIE=0, a pending interrupt causes execution to resume at the next instruction rather than jump to mtvec.

User mode (U-mode) and physical memory protection

M-mode is sufficient for simple embedded systems where the entire codebase is trusted, but most systems cannot trust all application code. U-mode restricts untrusted code from executing privileged instructions (e.g. mret) or accessing privileged CSRs (e.g. mstatus) — any such attempt raises an illegal instruction exception. M-mode enters U-mode by setting mstatus.MPP=0 then executing mret; any exception in U-mode returns control to M-mode.

Physical Memory Protection (PMP) restricts which memory addresses U-mode can access. On each U-mode fetch, load, or store, the address is compared against all PMP address registers (pmpaddr0–pmpaddrN); the matching entry’s configuration register decides whether the access proceeds or raises an access exception.

Address registers (pmpaddr0–pmpaddrN): stored shifted right by 2 bits (4-byte granularity).
Configuration registers (pmpcfg): densely packed to accelerate context switching.
- R / W / X: permit loads, stores, and instruction fetches respectively.
- A: 0 disables this PMP entry, 1 enables it.
- L: locks the entry until the next reset.

Supervisor mode (S-mode) and delegation

By default, all exceptions regardless of privilege mode transfer control to the M-mode handler. To avoid M-mode intercepting every OS-bound exception, M-mode can delegate classes of events directly to S-mode via:

mideleg — routes specific async interrupts directly to S-mode.
medeleg — routes specific sync exceptions directly to S-mode.
Exceptions never downgrade privilege: an M-mode exception always resolves in M-mode.

S-mode has its own exception-handling CSR subset — sepc, stvec, scause, sscratch, stval, sstatus, sie/sip — each performing the same role as its M-mode counterpart. sret behaves identically to mret but operates on S-mode CSRs.

On delegated exception — hardware atomically:

sepc ← PC, then PC ← stvec.
scause ← exception cause; stval ← faulting address or exception-specific data.
SPIE ← SIE, then SIE ← 0 (disables S-mode interrupts).
SPP ← current privilege mode, then privilege elevated to S-mode.

Page-based virtual memory

When paging is enabled, most addresses (load/store effective addresses and the PC) are virtual and must be translated to physical addresses. Accessing an unmapped page or one with insufficient permissions raises a page fault exception.

Pages — memory is divided into fixed-size 4 KiB base pages (the fundamental unit). This size has been standard for five decades. Larger alignments called megapages and gigapages also exist and map entire subtrees in one PTE.

Page table — a tree structure in memory that maps virtual page numbers to physical page numbers. Each node in the tree is itself exactly 4 KiB — the same as a base page — which simplifies OS memory allocation. A leaf node (PTE with R/W/X ≠ 0) holds a physical page number; a non-leaf node (R/W/X = 0) holds a pointer to the next level.

Page table entry (PTE) fields:

Field	Meaning
V	Valid; if 0 any traversal through this PTE faults
R / W / X	Read / write / execute permissions; all-zero = pointer to next level (non-leaf)
U	If 0: U-mode cannot access, S-mode can. If 1: U-mode can, S-mode cannot
G	Global — mapping exists in all address spaces; used for OS pages
A / D	Accessed / Dirty — set by hardware; OS uses them to approximate LRU and decide which pages to swap
RSW	Reserved for OS; hardware ignores it
PPN	Leaf: physical page number of target. Non-leaf: physical address of next-level page table

satp CSR — enables and configures the paging system. M-mode writes zero to satp before first entering S-mode (paging off); S-mode writes it again after building the page tables.

MODE: selects the scheme (Bare = off, Sv32, Sv39, Sv48).
ASID: optional per-process tag on TLB entries; reduces flush overhead on context switch.
PPN: physical address of the root page table divided by 4 KiB.

Addressing schemes — named SvX where X is the virtual address width:

Scheme	ISA	VA bits	PA bits	Tree depth	Page levels
Sv32	RV32	32	34	2 (radix 2¹⁰)	4 KiB, 4 MiB
Sv39	RV64	39	56	3 (radix 2⁹)	4 KiB, 2 MiB, 1 GiB
Sv48	RV64	48	56	4 (radix 2⁹)	one more level above Sv39

Sv32 uses 4-byte PTEs; Sv39/48 use 8-byte PTEs to hold wider physical addresses. The tree radix drops from 2¹⁰ to 2⁹ to preserve the invariant that one page table fits in exactly one page.

Sv39 unused bits — virtual addresses are 39 bits stored in 64-bit registers; bits 63–39 must replicate bit 38. Valid ranges: 0x0000_0000_0000_0000–0x0000_003f_ffff_ffff and 0xffff_ffc0_0000_0000–0xffff_ffff_ffff_ffff. Violations fault. The gap is intentional — future ISA versions can reclaim those bits to extend the address space without breaking compatibility.

Address translation (Sv39):

satp.PPN × 4096 + VA[38:30] × 8 → fetch level-2 PTE.
PTE.PPN × 4096 + VA[29:21] × 8 → fetch level-1 PTE.
PTE.PPN × 4096 + VA[20:12] × 8 → fetch leaf PTE.
Physical address = LeafPTE.PPN[2:0] × 4096 + VA[11:0].
Processor performs the original load/store to that physical address.

For a normal 4 KiB leaf all three concatenate. For superpages, the lower PPN sub-fields are replaced by the matching VPN sub-fields instead: a gigapage (1 GiB) leaf at level 0 uses only PPN[2] from the PTE; a megapage (2 MiB) leaf at level 1 uses PPN[2]:PPN[1]. Each page table holds exactly 512 entries (4096 / 8), so each VPN sub-field is 9 bits (2⁹ = 512). Sv32 uses the same logic with 4-byte PTEs and 1024-entry tables (VPN sub-fields are 10 bits).

TLB and sfence.vma — walking the page table on every memory access would halve performance. Processors cache recent translations in a TLB. The TLB is not automatically kept coherent with the page table; when S-mode modifies page tables it must execute sfence.vma to flush stale entries. Optional arguments narrow the flush: rs1 scopes it to one virtual address, rs2 scopes it to one ASID; x0 for both flushes the entire TLB.

Future RISC-V Optional Extensions

All extensions are optional and modular. The RISC-V Foundation ratifies them only after public debate and at least one implementation, keeping the rate of change deliberately slow.

B — Bit Manipulation

Hardware instructions for bit-field insert/extract/test, rotations, funnel shifts, bit/byte permutations, and counts (leading zeros, trailing zeros, set bits).

E — Embedded

Reduces the integer register file from 32 to 16 (x0–x15) to cut die area on cost-constrained cores. Paired with RV32I as RV32E.

H — Hypervisor

Adds a hypervisor privilege level with a second stage of page-based address translation, enabling efficient concurrent execution of multiple OSes on one hart.

J — Dynamically Translated Languages

ISA support for JIT-compiled languages (Java, JavaScript): hardware dynamic runtime checks and accelerated garbage collection barriers.

L — Decimal Floating-Point

IEEE 754-2008 decimal FP arithmetic. Eliminates the binary approximation error for decimal fractions (e.g., 0.1) by matching the computation radix to the I/O radix.

N — User-Level Interrupts

Routes U-mode interrupts and exceptions directly to a user-level trap handler, bypassing M/S-mode. Primary use: secure embedded systems (M+U only). In Unix environments, a building block for user-level events such as GC barriers, integer overflow, and FP traps.

P — Packed-SIMD

Subdivides existing registers for data-parallel computation on narrow types, reusing wide datapaths. A lightweight alternative to V; prefer V when dedicated hardware resources are available.

Q — Quad-Precision Floating-Point

128-bit quad-precision binary FP compliant with IEEE 754-2008. FP registers extend to hold single, double, or quad-precision values. Requires RV64IFD.

My Knowledge Base

Explorer

3 RISC-V

Why RISC-V?

Modular vs. incremental design

Design metrics

Standard extensions

Stability

RV32I

Registers

Instruction formats

Integer computation

Loads and stores

Conditional branches

Unconditional jumps

System instructions

Encoding reference

RISC-V Assembly

Calling convention

Assembler directives

Pseudoinstructions

Memory layout

Linking

RV32M

Multiplication

Division and remainder

Design notes

RV32F and RV32D

Registers and state

Loads, stores, and register transfers

Arithmetic

Comparisons and control flow

Conversion, sign injection, and classification

RV32A

LR/SC

AMOs

Memory ordering

RV32C: Compressed Instructions

RV32V

Registers and dynamic typing

Computation instructions

Loads and stores

Vector length and loop control

Conditional execution

Permutation and reduction

Vector vs. SIMD

RV64

RV64I additions

Extension adaptations

ABIs

Code density

Comparison to other 64-bit ISAs

RV Privileged Architecture

Machine mode (M-mode)

User mode (U-mode) and physical memory protection

Supervisor mode (S-mode) and delegation

Page-based virtual memory

Future RISC-V Optional Extensions

B — Bit Manipulation

E — Embedded

H — Hypervisor

J — Dynamically Translated Languages

L — Decimal Floating-Point

N — User-Level Interrupts

P — Packed-SIMD

Q — Quad-Precision Floating-Point