Why RISC-V?
RISC-V is an open, modular ISA designed to run across the full computing spectrum — from embedded microcontrollers to supercomputers — without being owned or discontinued by any single company.
Modular vs. incremental design
Conventional architectures grow incrementally: every new processor must implement all past extensions to preserve binary compatibility. x86 expanded from 80 instructions to over 3,600, growing at roughly three per month.
RISC-V breaks this pattern with a strictly modular approach:
- A frozen, minimal base ISA (RV32I) that can run a complete software stack.
- Optional standard extensions for specific needs — hardware only includes what it requires.
- If software invokes an omitted extension, the hardware traps and a software library handles it.
Design metrics
Seven measures govern RISC-V architectural decisions:
- Cost: Die area scales non-linearly with cost (). Smaller dies improve both cost and yield. A RISC-V core requires roughly half the die area of an equivalent ARM-32 core.
- Simplicity: Complex instructions are often ignored by compilers anyway. Simpler ISAs reduce design and verification cost.
- Performance: A simpler ISA may need more instructions per program but enables faster clocks and lower CPI.
- Implementation isolation: ISA features optimized for one microarchitecture generation must not penalize future ones. Examples of past mistakes: delayed branches (helped 5-stage pipelines, hurt out-of-order cores) and load-multiple (good for single-issue, bad for multi-issue scheduling).
- Room for growth: Opcode space must be reserved for future custom accelerators. Exhausting it forces workarounds like separate 16-bit ISAs toggled via address bits.
- Program size: Smaller binaries reduce instruction cache misses and DRAM power. Combining 32-bit and 16-bit compressed instructions beats variable-length encodings burdened by legacy prefixes.
- Ease of programming and compiling: 32 integer registers simplify register allocation versus 8 or 16. Native PC-relative addressing supports position-independent code and dynamic linking.
Standard extensions
| Extension | Name | Purpose |
|---|---|---|
| M | Multiply/Divide | Integer multiply and divide; omitted on minimal embedded chips |
| F | Single-precision FP | IEEE 754 single-precision floating point |
| D | Double-precision FP | IEEE 754 double-precision floating point |
| A | Atomic | Load-reserved/store-conditional and AMOs for multiprocessor sync |
| C | Compressed | 16-bit encodings of common 32-bit instructions; ~400 gates of decoder overhead |
| V | Vector | Dynamic vector length and type per register, replacing fixed SIMD |
| RV64 | 64-bit | Widens registers and adds doubleword variants; preserves RV32 structure |
| Privileged | System | Machine/Supervisor/User modes, hardware paging, OS execution |
RV32G (or RV64G) denotes the combined IMAFD base — the standard general-purpose configuration.
Stability
The complete RISC-V specification is ~236 pages. Equivalent incremental architectures require 2,100–2,700 pages. A frozen base ISA paired with openly debated optional extensions keeps the compiler and OS targets stable indefinitely.
RV32I
RV32I is the frozen base integer ISA — 32-bit registers, 32-bit fixed-width instructions, enough to run a complete software stack.

Registers
- 32 general-purpose 32-bit registers (
x0–x31). x0is hardwired to zero, eliminating the need for dedicated zero-state or unary instructions. Moves and negations are synthesized using standard instructions withx0as a source.- The PC is separate from the register file. Keeping it out of the general registers prevents arbitrary arithmetic from causing control-flow side effects, which stabilizes branch prediction.
Instruction formats
Six fixed 32-bit formats cover all instruction classes:
- R-type: Register-to-register operations (two sources, one destination).
- I-type: Short immediates and loads.
- S-type: Stores.
- B-type: Conditional branches — a rotated variant of S-type.
- U-type: Long immediates (20-bit upper).
- J-type: Unconditional jumps — a rotated variant of U-type.

Key encoding choices:
- Fixed register specifiers:
rs1,rs2, andrdsit in identical bit positions across all formats, so register reads begin before decoding completes. - Sign bit always at bit 31: Immediate sign extension runs in parallel with decode.
- Rotated immediates: Immediate bits are scattered to minimize signal fanout and hardware multiplexing cost — the B and J scrambling looks odd but reduces wiring.
- Trap patterns: All-zeros and all-ones are illegal instructions, catching out-of-bounds jumps and unprogrammed memory.
Integer computation
- Arithmetic:
add,sub;addi(nosubi— use negative immediate). - Logical:
and,or,xorand their immediate forms. - Shifts:
sll,srl,sra(logical/arithmetic) with register or immediate shift amount. - Comparison:
slt,sltu,slti,sltiu— write 1 tordif true, 0 otherwise. - Upper immediates:
luiloads a 20-bit constant into the upper 20 bits;auipcadds it to the PC. Combining either with a 12-bit immediate instruction synthesizes any 32-bit constant or PC-relative address in two instructions.
Multiply, divide, and overflow detection are excluded to keep the minimal hardware footprint small. Overflow is handled in software; multiply/divide live in the M extension.
Loads and stores
Single addressing mode: base register + sign-extended 12-bit immediate.
lw/sw: 32-bit word.lh/sh: 16-bit halfword; loads sign-extend to 32 bits.lb/sb: 8-bit byte; loads sign-extend to 32 bits.lhu/lbu: zero-extending unsigned variants.
Memory is little-endian. Unaligned accesses are supported natively. No push/pop — stack operations are just sw/lw with the stack pointer register and displacement addressing.
Conditional branches
Branches compare two registers directly — no condition codes. Condition codes create implicit dependencies that stall out-of-order pipelines.
beq,bne: equality.blt,bge: signed magnitude.bltu,bgeu: unsigned magnitude.
Inverse comparisons use swapped operands (). The 12-bit immediate is multiplied by 2, sign-extended, and added to the PC. No delayed branches — they were removed to avoid binding the ISA to any particular pipeline depth.
Unconditional jumps
jal: PC-relative jump using a 20-bit immediate × 2; saves PC+4 tord(return address).jalr: Register-indirect jump usingbase + 12-bit immediate; saves PC+4 tord.
Setting rd = x0 discards the link, giving a plain jump or subroutine return.
System instructions
- CSR instructions (
csrrw,csrrs,csrrc+ immediate variants): read/write hardware counters — cycle timer, wall-clock time, instruction retirement count. ecall: Request a service from the OS or execution environment.ebreak: Transfer control to the debugger.fence: Order I/O and memory accesses visible to other threads or devices.fence.i: Flush the instruction pipeline so recent stores are visible to instruction fetch.
The ABI and OS conventions give these instructions their meaning — ecall alone does nothing without a defined calling convention.
Encoding reference

RISC-V Assembly
Calling convention
Function execution follows a fixed lifecycle: place arguments, jump (jal), acquire local storage and save registers, execute, place result and release storage, return (ret).
Registers are partitioned by preservation guarantee:
- Temporaries — not preserved across a call: arguments and return values (
a0–a7), temporaries (t0–t6), return address (ra). - Saved registers — callee must preserve:
s0–s11, stack pointer (sp). - Hardwired zero —
x0always reads as 0.

Stack frame:
- Prologue:
addi sp, sp, -framesizeto allocate; save registers to stack (e.g.,sw ra, framesize-4(sp)). - Epilogue: restore registers,
addi sp, sp, framesize, thenret.
RV32E: embedded variant that cuts the register file to 16 (x0–x15) to reduce die area.
Assembler directives
Directives control data placement and code structure:
| Directive | Effect |
|---|---|
.text | Subsequent items go into the code section |
.data | Subsequent items go into initialized data |
.bss | Subsequent items go into zero-initialized data |
.section .foo | Subsequent items go into section .foo |
.align n | Align next datum to -byte boundary |
.balign n | Align next datum to exact -byte boundary |
.globl sym | Export sym as globally visible |
.string "str" | Store null-terminated string |
.byte b1,... | Store 8-bit values |
.half w1,... | Store 16-bit halfwords |
.word w1,... | Store 32-bit words |
.dword w1,... | Store 64-bit doublewords |
.float f1,... | Store single-precision FP values |
.double d1,... | Store double-precision FP values |
.option rvc (norvc) | Enable/disable compressed instruction emission |
.option pic (nopic) | Enable/disable position-independent code |
.option relax (norelax) | Enable/disable linker relaxation |
.option push (pop) | Save/restore current option state |
Pseudoinstructions
Arithmetic and register — missing operations synthesized using x0 as a zero source or identity immediates
| Pseudo | Expands to |
|---|---|
nop | addi x0, x0, 0 |
neg rd, rs | sub rd, x0, rs |
negw rd, rs | subw rd, x0, rs |
mv rd, rs | addi rd, rs, 0 |
not rd, rs | xori rd, rs, -1 |
sext.w rd, rs | addiw rd, rs, 0 |
li rd, imm | lui+addi sequence (arbitrary 32-bit constant) |
Set conditions — hardware only has slt/sltu; all four cases are covered by swapping operands or using x0
| Pseudo | Expands to |
|---|---|
seqz rd, rs | sltiu rd, rs, 1 |
snez rd, rs | sltu rd, x0, rs |
sltz rd, rs | slt rd, rs, x0 |
sgtz rd, rs | slt rd, x0, rs |
Branches against zero — hardware only provides blt/bge; the rest swap operands or use x0
| Pseudo | Expands to |
|---|---|
beqz rs, off | beq rs, x0, off |
bnez rs, off | bne rs, x0, off |
blez rs, off | bge x0, rs, off |
bgez rs, off | bge rs, x0, off |
bltz rs, off | blt rs, x0, off |
bgtz rs, off | blt x0, rs, off |
Operand-swapped branches — since A > B ≡ B < A
| Pseudo | Expands to |
|---|---|
bgt rs, rt, off | blt rt, rs, off |
ble rs, rt, off | bge rt, rs, off |
bgtu rs, rt, off | bltu rt, rs, off |
bleu rs, rt, off | bgeu rt, rs, off |
Jumps and calls — omitting rd defaults to x0 (discard link) or x1 (save return address); call/tail use two instructions to reach any 32-bit offset
| Pseudo | Expands to |
|---|---|
j off | jal x0, off |
jr rs | jalr x0, rs, 0 |
ret | jalr x0, x1, 0 |
jal off | jal x1, off |
jalr rs | jalr x1, rs, 0 |
call off | auipc x1, off[31:12] + jalr x1, x1, off[11:0] |
tail off | auipc x6, off[31:12] + jalr x0, x6, off[11:0] |
Addressing and memory — all expand to auipc+load/store pairs, since a symbol address requires a 32-bit PC-relative offset that no single instruction can encode
| Pseudo | Expands to |
|---|---|
la rd, sym | auipc+addi/lw/ld (PC-relative or GOT) |
lla rd, sym | auipc+addi (local PC-relative only) |
l{b|h|w|d} rd, sym | auipc + respective load |
s{b|h|w|d} rs, sym, rt | auipc + respective store |
fl{w|d} rd, sym, rt | auipc+flw/fld |
fs{w|d} rs, sym, rt | auipc+fsw/fsd |
Floating-point — sign-injection hardware exploited for three common ops
| Pseudo | Expands to |
|---|---|
fmv.s/d rd, rs | fsgnj.s/d rd, rs, rs |
fabs.s/d rd, rs | fsgnjx.s/d rd, rs, rs |
fneg.s/d rd, rs | fsgnjn.s/d rd, rs, rs |
CSR and counters — all route through csrrs/csrrw/csrrc with x0 to discard or supply a zero source
| Pseudo | Expands to |
|---|---|
csrr rd, csr | csrrs rd, csr, x0 |
csrw csr, rs | csrrw x0, csr, rs |
csrs csr, rs | csrrs x0, csr, rs |
csrc csr, rs | csrrc x0, csr, rs |
csrwi/csrsi/csrci csr, imm | immediate variants of above |
rdcycle[h] rd | csrrs rd, cycle[h], x0 |
rdtime[h] rd | csrrs rd, time[h], x0 |
rdinstret[h] rd | csrrs rd, instret[h], x0 |
frcsr rd / fscsr rs | read/write fcsr via csrrs/csrrw |
frrm rd / fsrm rs | read/write frm via csrrs/csrrw |
frflags rd / fsflags rs | read/write fflags via csrrs/csrrw |
Miscellaneous — bare fence with no operands is shorthand for the maximally conservative ordering
fence— defaults tofence iorw, iorw(all memory and I/O).
Memory layout
High addresses
┌─────────────┐
│ Stack │ grows downward
├─────────────┤
│ ↓ │
│ ↑ │
├─────────────┤
│ Heap │ grows upward (dynamic allocation)
├─────────────┤
│ Static data │ globals, constants
├─────────────┤
│ Text │ machine instructions (starts at 0x00010000)
└─────────────┘
Low addresses
Position-independent code (PIC): uses PC-relative addressing (auipc, jalr) so the binary runs correctly regardless of where it is loaded in memory.
ABIs:
ilp32: The C language data typesint,long, and pointers are 32 bits; FP arguments pass through integer registers.ilp32f,ilp32d: single- or double-precision FP arguments pass through dedicated FP registers.
Linking
Linker relaxation: the linker replaces multi-instruction call sequences (auipc + jalr) with a single shorter instruction when the target is within ±2 KiB of the global pointer (gp) or thread pointer (tp).
Static linking: all library code is copied into the executable. Wastes memory when multiple programs share the same library, and ties the binary to a fixed library version.
Dynamic linking: libraries are mapped into memory at the moment of first call.
- First call hits a 3-instruction stub that invokes the dynamic linker, which maps the function and patches the symbol table pointer.
- Subsequent calls jump directly through the updated pointer.
- The library exists once in system memory regardless of how many processes use it.
Loader: the OS injects the binary into memory, starts the dynamic linker for any unresolved dependencies, and transfers control to the entry point.
RV32M
The M extension adds integer multiply and divide to the base ISA. It is optional — embedded chips that never need it can omit it entirely, with software fallback via trap.

Multiplication
Two 32-bit operands produce a 64-bit product. Rather than write to two destination registers at once, the result is retrieved in two separate instructions:
mul: lower 32 bits of the product (signed or unsigned — same bits either way).mulh: upper 32 bits, both operands signed.mulhu: upper 32 bits, both operands unsigned.mulhsu: upper 32 bits, one signed and one unsigned — used as a substep in multi-word signed multiplication.
Overflow detection:
- Unsigned: overflow absent if
mulhuresult is zero. - Signed: overflow absent if all bits of
mulhmatch the sign bit ofmul(0 for positive,0xFFFFFFFFfor negative).
Division and remainder
div/divu: signed and unsigned quotient.rem/remu: signed and unsigned remainder.
No hardware trap on divide-by-zero. Software handles it with a beqz check on the divisor before the division instruction.
Design notes
- Results go directly into general-purpose registers — no dedicated
HI/LOregisters like MIPS-32. Dedicated registers add architectural state, slow context switches, and require extra move instructions. - Compilers optimize constant division: powers of 2 use shifts (
srlfor unsigned ); other constants use multiplication by an approximate reciprocal plus correction on the upper half. - ARM-32 had no hardware divide at all until 2005.
RV32F and RV32D
F adds single-precision (32-bit) and D adds double-precision (64-bit) floating-point, both conforming to IEEE 754-2008.

Registers and state
- 32 dedicated FP registers
f0–f31, separate from the integer file. Doubling register bandwidth this way avoids widening the instruction register specifier fields. f0is a normal read-write register — unlike integerx0, it is not hardwired to zero.- When both F and D are implemented, single-precision operations use the lower 32 bits of the 64-bit
fregisters.
fcsr (floating-point control and status register):
frm— rounding mode: round-to-nearest-even (default), round-toward-zero, round-down, round-up, round-to-nearest-max-magnitude. Individual instructions can override via a static rounding mode argument.fflags— five accrued exception flags: Invalid (NV), Divide-by-zero (DZ), Overflow (OF), Underflow (UF), Inexact (NX).
Loads, stores, and register transfers
flw/fsw: 32-bit load/store usingbase + 12-bit immediate, same addressing as integer.fld/fsd: 64-bit load/store.fmv.x.w: copy a single-precision value fromftoxregister (bit-exact, no conversion).fmv.w.x: copy fromxtofregister.
Arithmetic
- Standard:
fadd,fsub,fmul,fdiv,fsqrt— all with.s(single) and.d(double) suffixes. fmin/fmax: write the smaller or larger of two operands directly, no branch needed.- Fused multiply-add (R4 format — three sources, one destination):
fmadd:fmsub:fnmadd:fnmsub:- A single rounding step at the end gives higher precision and speed than a separate multiply followed by add.
Comparisons and control flow
No dedicated FP branch instructions. Instead, comparisons write a boolean into an integer register, and standard integer branches act on it:
feq.s/d,flt.s/d,fle.s/d— write 1 or 0 to anxregister.
Conversion, sign injection, and classification
Conversion (fcvt family):
- Between signed/unsigned 32-bit integers and single/double precision in both directions.
- Between single and double precision (
fcvt.s.d,fcvt.d.s).
Sign injection — copies a value while manipulating only its sign bit:
fsgnj: take sign from a second source.fsgnjn: take inverted sign from a second source.fsgnjx: XOR the sign bits of both sources.- These underpin pseudoinstructions:
fabsusesfsgnjx(),fnegusesfsgnjn,fmvusesfsgnj.
Classification (fclass.s/d): writes a 10-bit one-hot mask to an integer register identifying which of the 10 IEEE 754 states the operand is in: , negative normal, negative subnormal, , , positive subnormal, positive normal, , signaling NaN, quiet NaN.
RV32A
The A extension provides atomic instructions for synchronization in multiprocessor environments. All RV32A instructions require naturally aligned addresses — hardware cannot efficiently guarantee atomicity across cache-line boundaries.

LR/SC
LR/SC implements an atomic operation across two linked instructions, avoiding a three-operand instruction that would complicate the standard datapath.
lr.w(load reserved): reads a word from memory into a register and places a reservation on that address.sc.w(store conditional): attempts to write to the reserved address.- Succeeds: writes the value, sets destination register to 0.
- Fails (reservation broken by another hart): destination register gets a nonzero code; memory is unchanged.
This pair synthesizes any synchronization primitive, including compare-and-swap (CAS).
AMOs
AMOs execute a full read-modify-write as a single indivisible hardware operation — no interrupt or remote modification can occur between the read and the write.
Execution: read current value → apply ALU operation with a source register → write result back → return the original value to the destination register.
| Instruction | Operation |
|---|---|
amoswap.w | Swap |
amoadd.w | Add |
amoand.w, amoor.w, amoxor.w | Bitwise AND, OR, XOR |
amomin.w, amomax.w | Signed min/max |
amominu.w, amomaxu.w | Unsigned min/max |
AMOs scale better than LR/SC polling loops in large multiprocessor systems and streamline atomic I/O device communication.
Memory ordering
RISC-V uses a relaxed memory model — harts may observe accesses out of program order. All RV32A instructions carry two annotation bits to enforce ordering at critical points:
aq(acquire): when set, this atomic op is visible before all subsequent memory accesses by this hart.rl(release): when set, this atomic op is visible after all previous memory accesses by this hart.
Lock acquire sets aq to ensure the lock is held before guarded data is read. Lock release sets rl to ensure all data writes are visible before the lock is relinquished.
RV32C: Compressed Instructions
The C extension maps the most common 32-bit instructions to 16-bit encodings, shrinking binary size without changing the ISA visible to the compiler or programmer.
- The assembler transparently picks the 16-bit form whenever possible — the compiler emits normal instructions and never knows.
- The hardware decoder expands 16-bit instructions back to their 32-bit equivalents before execution, adding only ~400 gates of overhead.
- The two lowest bits of every 32-bit instruction are always
11; any other pattern signals a 16-bit instruction, making 16/32-bit interleaving unambiguous.
RV64C diverges slightly from RV32C: it drops c.jal (rare in 64-bit code) and word-sized load/stores (c.lw, c.sw, c.flw, etc.), replacing them with 64-bit variants (c.ld, c.sd, c.addw, c.addiw, c.subw).
Combined with the 32-bit base, RV32GC produces binaries significantly smaller than architectures with fixed-width encodings.
RV32V
The V extension replaces SIMD with a true vector model: vector length and element width are decoupled from the instruction encoding, so a single binary scales across hardware implementations without recompilation.

Registers and dynamic typing
- 32 vector registers
v0–v31; data type and element width are set per-register, not per-opcode. vsetdcfg: configures which registers are active and their types —X8/16/32/64,X8U/...,F16/F32/F64.- Maximum vector length (
mvl): computed at runtime from total vector SRAM divided by the active element types. Disabling unused registers reallocates their memory, increasingmvlfor active ones. - Only enabled registers are saved/restored on context switch — unused registers cost nothing.
Computation instructions
- Arithmetic, logical, and FP operations from the base ISAs carry over with operand-type suffixes:
.vv— both operands are vector registers..vs— vector op scalar (xorfregister)..sv— scalar op vector (for asymmetric ops likeY = a - X).
- Fused multiply-add uses three-source suffixes:
.vvv,.vvs,.vsv,.vss.
Loads and stores
| Mode | Instructions | Use case |
|---|---|---|
| Sequential | vld / vst | Contiguous arrays; 7-bit immediate offset scaled by element size |
| Strided | vlds / vsts | Multi-dimensional arrays; base register + byte-stride register |
| Indexed (gather/scatter) | vldx / vstx | Sparse arrays; base register + vector of byte offsets |
Vector length and loop control
setvl: setsvl = min(requested, mvl). Handles arrays of any length — including edge cases and zero — without separate fringe logic. Eliminates strip-mining.
Conditional execution
- 8 predicate registers
vp0–vp7, each holdingmvlbits. A1bit allows the element to be written;0leaves it unchanged. - Comparison instructions (
vplt,vpeq, etc.) populate a predicate register from a vector condition. - Operations specify
vp0orvp1as their mask;vpswapmoves another predicate into the active slot. - Predicate logic:
vpand,vpor,vpxor,vpnotfor compound conditions.
Permutation and reduction
vselect: gather elements from a source vector using indices from a second vector.vmerge: merge elements from two vectors based on a predicate mask.vextract: copy a subset of elements from a calculated offset to the start of a destination vector; enables binary-halving reductions (e.g., sum all elements by iteratively halving untilvl = 1).
Vector vs. SIMD
| Vector (RV32V) | SIMD (x86 AVX / ARM NEON) | |
|---|---|---|
| Code size | No strip-mining; setvl handles all lengths | Requires loop bookkeeping and fringe handling |
| Instruction count | 10–20× fewer dynamic instructions | Short registers force many iterations |
| ISA stability | Static; mvl expands transparently with wider hardware | Hundreds of new opcodes whenever register width increases |
RV64
RV64 widens all registers (including the PC) to 64 bits and adds a minimal set of word/doubleword variants to the RV32 base. The architecture structure is preserved — no overhaul required.
RV64I additions
- Word arithmetic:
addw,addiw,subw— compute in 32 bits, sign-extend result to 64 bits. - Word shifts:
sllw,slliw,srlw,srliw,sraw,sraiw— explicit 32-bit shift results. - Doubleword memory:
ld/sd— transfer 8 bytes at a time. - Unsigned word load:
lwuzero-extends a 32-bit load to 64 bits; the existinglwsign-extends.
Extension adaptations
- RV64M: adds word multiply/divide/remainder variants —
mulw,divw,divuw,remw,remuw. - RV64A: adds doubleword variants for all 11 atomics —
amoadd.d,lr.d,sc.d, etc. - RV64F/D: adds long (64-bit integer) conversion instructions —
fcvt.l.s,fcvt.lu.d, etc. RV64D also addsfmv.x.dandfmv.d.xfor direct 64-bit moves between integer and FP registers. - RV64C: drops
c.jaland all word-sized compressed loads/stores; replaces them with 64-bit variants —c.ld,c.sd,c.addw,c.addiw,c.subw,c.ldsp,c.sdsp.
ABIs
lp64: The C data typeslongand pointers are 64 bits;intstays 32 bits; FP args use integer registers.lp64f,lp64d: single- or double-precision FP arguments pass through FP registers.
Code density
RV64GC is only 1% larger than RV32GC. Compared to other 64-bit ISAs:
- 23% smaller than ARM-64 (which dropped Thumb-2 compressed format entirely).
- 34% smaller than x86-64 (which burns bytes on legacy prefix encoding).
Comparison to other 64-bit ISAs
- x86-64: extended x86-32 by doubling registers and adding PC-relative data addressing, but required prefix bytes to fit new operations into an already-full opcode space — inflating average instruction length.
- ARM-64: invented a brand-new 1000+ instruction ISA rather than extending ARM-32. Gained 31 registers and a hardwired zero, but dropped Thumb-2, making ARM-64 code 25% larger than ARM Thumb-2.
- RISC-V: RV32 and RV64 were engineered simultaneously, so 64-bit instructions were never forced into a cramped 32-bit opcode space. RV64I retains virtually all RV32I instructions, keeping the compiler transition simple.
RV Privileged Architecture
Three privilege levels isolate hardware access, OS execution, and application code.
| Mode | Level | Purpose |
|---|---|---|
| Machine (M) | Highest, mandatory | Full hardware access; bootstraps and controls the system |
| Supervisor (S) | Optional | OS execution; virtual memory and multitasking |
| User (U) | Lowest | Untrusted applications; restricted CSR and memory access |

Machine mode (M-mode)
The most important feature of M-mode is the ability to intercept and handle exceptions — unusual runtime events. All exceptions are precise: instructions before the faulting instruction complete; the faulting instruction and those after do not.
- Synchronous exceptions: caused directly by instruction execution — the faulting instruction itself triggers the event (e.g. illegal opcode, misaligned access, ecall).
- Asynchronous interrupts: caused by external events independent of the instruction stream — handling and masking are uniform across all RISC-V systems, though memory maps and interrupt controller mechanisms vary per platform.
Exception-handling CSRs — seven registers collectively capture the full exception state:
| CSR | Purpose |
|---|---|
mtvec | Handler base address; MODE=1 vectorizes async interrupts to BASE + 4×cause |
mepc | PC saved on exception entry (sync: faulting instruction; async: resume point) |
mcause | Exception cause; MSB=1 for interrupts, 0 for synchronous exceptions |
mtval | Faulting address or exception-specific data (e.g. illegal instruction bits) |
mstatus | Global interrupt enable (MIE), previous privilege (MPP), previous MIE (MPIE) |
mie / mip | Per-source interrupt enable / pending bits; bit positions match mcause codes |
mscratch | Scratch register; software points it to an in-memory context-save area |
Bit layout of the mie and mip CSRs — each bit position corresponds to a mcause interrupt code.
mcause encoding — MSB=1 for interrupts, 0 for synchronous exceptions; lower bits identify the specific cause. Supervisor interrupts and page-fault exceptions only exist when S-mode is implemented.
| Interrupt (MSB) | Code | Description |
|---|---|---|
| 1 | 1 | Supervisor software interrupt |
| 1 | 3 | Machine software interrupt |
| 1 | 5 | Supervisor timer interrupt |
| 1 | 7 | Machine timer interrupt |
| 1 | 9 | Supervisor external interrupt |
| 1 | 11 | Machine external interrupt |
| 0 | 0 | Instruction address misaligned |
| 0 | 1 | Instruction access fault |
| 0 | 2 | Illegal instruction |
| 0 | 3 | Breakpoint |
| 0 | 4 | Load address misaligned |
| 0 | 5 | Load access fault |
| 0 | 6 | Store address misaligned |
| 0 | 7 | Store access fault |
| 0 | 8 | Environment call from U-mode |
| 0 | 9 | Environment call from S-mode |
| 0 | 11 | Environment call from M-mode |
| 0 | 12 | Instruction page fault |
| 0 | 13 | Load page fault |
| 0 | 15 | Store page fault |
Interrupt gating — an interrupt is taken only when all three hold simultaneously: mstatus.MIE=1 (global), the per-source mie bit is set (enabled), and the corresponding mip bit is set (pending). Timer interrupt example: mstatus.MIE=1, mie[7]=1, mip[7]=1.
On exception — hardware atomically performs these steps:
mepc ← PC, thenPC ← mtvec.mcause ← exception cause;mtval ← faulting address or exception-specific data.MPIE ← MIE, thenMIE ← 0(disables further interrupts).MPP ← current privilege mode, then privilege elevated to M-mode.
Handler prologue/epilogue — on entry, swap a register (e.g. a0) with mscratch to get a pointer to scratch space, then save all registers the body will use. On exit, restore those registers, swap a0/mscratch again, then execute mret. For a preemptible handler, also save mepc/mcause/mtval/mstatus to the stack before re-enabling interrupts — a nested exception would overwrite them.
On mret — reverses exception entry:
PC ← mepc.MIE ← MPIE(restores interrupt enable).- Privilege mode ←
MPP.
wfi — informs the processor there is no useful work; it enters a low-power state until (mie & mip) ≠ 0. Typically used inside a loop. If MIE=0, a pending interrupt causes execution to resume at the next instruction rather than jump to mtvec.
User mode (U-mode) and physical memory protection
M-mode is sufficient for simple embedded systems where the entire codebase is trusted, but most systems cannot trust all application code. U-mode restricts untrusted code from executing privileged instructions (e.g. mret) or accessing privileged CSRs (e.g. mstatus) — any such attempt raises an illegal instruction exception. M-mode enters U-mode by setting mstatus.MPP=0 then executing mret; any exception in U-mode returns control to M-mode.
Physical Memory Protection (PMP) restricts which memory addresses U-mode can access. On each U-mode fetch, load, or store, the address is compared against all PMP address registers (pmpaddr0–pmpaddrN); the matching entry’s configuration register decides whether the access proceeds or raises an access exception.
- Address registers (
pmpaddr0–pmpaddrN): stored shifted right by 2 bits (4-byte granularity). - Configuration registers (
pmpcfg): densely packed to accelerate context switching.R / W / X: permit loads, stores, and instruction fetches respectively.A:0disables this PMP entry,1enables it.L: locks the entry until the next reset.
Supervisor mode (S-mode) and delegation
By default, all exceptions regardless of privilege mode transfer control to the M-mode handler. To avoid M-mode intercepting every OS-bound exception, M-mode can delegate classes of events directly to S-mode via:
mideleg— routes specific async interrupts directly to S-mode.medeleg— routes specific sync exceptions directly to S-mode.- Exceptions never downgrade privilege: an M-mode exception always resolves in M-mode.
S-mode has its own exception-handling CSR subset — sepc, stvec, scause, sscratch, stval, sstatus, sie/sip — each performing the same role as its M-mode counterpart. sret behaves identically to mret but operates on S-mode CSRs.
On delegated exception — hardware atomically:
sepc ← PC, thenPC ← stvec.scause ← exception cause;stval ← faulting address or exception-specific data.SPIE ← SIE, thenSIE ← 0(disables S-mode interrupts).SPP ← current privilege mode, then privilege elevated to S-mode.
Page-based virtual memory
When paging is enabled, most addresses (load/store effective addresses and the PC) are virtual and must be translated to physical addresses. Accessing an unmapped page or one with insufficient permissions raises a page fault exception.
Pages — memory is divided into fixed-size 4 KiB base pages (the fundamental unit). This size has been standard for five decades. Larger alignments called megapages and gigapages also exist and map entire subtrees in one PTE.
Page table — a tree structure in memory that maps virtual page numbers to physical page numbers. Each node in the tree is itself exactly 4 KiB — the same as a base page — which simplifies OS memory allocation. A leaf node (PTE with R/W/X ≠ 0) holds a physical page number; a non-leaf node (R/W/X = 0) holds a pointer to the next level.
Page table entry (PTE) fields:

| Field | Meaning |
|---|---|
| V | Valid; if 0 any traversal through this PTE faults |
| R / W / X | Read / write / execute permissions; all-zero = pointer to next level (non-leaf) |
| U | If 0: U-mode cannot access, S-mode can. If 1: U-mode can, S-mode cannot |
| G | Global — mapping exists in all address spaces; used for OS pages |
| A / D | Accessed / Dirty — set by hardware; OS uses them to approximate LRU and decide which pages to swap |
| RSW | Reserved for OS; hardware ignores it |
| PPN | Leaf: physical page number of target. Non-leaf: physical address of next-level page table |
satp CSR — enables and configures the paging system. M-mode writes zero to satp before first entering S-mode (paging off); S-mode writes it again after building the page tables.
- MODE: selects the scheme (Bare = off, Sv32, Sv39, Sv48).
- ASID: optional per-process tag on TLB entries; reduces flush overhead on context switch.
- PPN: physical address of the root page table divided by 4 KiB.
Addressing schemes — named SvX where X is the virtual address width:
| Scheme | ISA | VA bits | PA bits | Tree depth | Page levels |
|---|---|---|---|---|---|
| Sv32 | RV32 | 32 | 34 | 2 (radix 2¹⁰) | 4 KiB, 4 MiB |
| Sv39 | RV64 | 39 | 56 | 3 (radix 2⁹) | 4 KiB, 2 MiB, 1 GiB |
| Sv48 | RV64 | 48 | 56 | 4 (radix 2⁹) | one more level above Sv39 |
Sv32 uses 4-byte PTEs; Sv39/48 use 8-byte PTEs to hold wider physical addresses. The tree radix drops from 2¹⁰ to 2⁹ to preserve the invariant that one page table fits in exactly one page.
Sv39 unused bits — virtual addresses are 39 bits stored in 64-bit registers; bits 63–39 must replicate bit 38. Valid ranges: 0x0000_0000_0000_0000–0x0000_003f_ffff_ffff and 0xffff_ffc0_0000_0000–0xffff_ffff_ffff_ffff. Violations fault. The gap is intentional — future ISA versions can reclaim those bits to extend the address space without breaking compatibility.
Address translation (Sv39):
satp.PPN × 4096 + VA[38:30] × 8→ fetch level-2 PTE.PTE.PPN × 4096 + VA[29:21] × 8→ fetch level-1 PTE.PTE.PPN × 4096 + VA[20:12] × 8→ fetch leaf PTE.- Physical address =
LeafPTE.PPN[2:0] × 4096 + VA[11:0]. - Processor performs the original load/store to that physical address.
For a normal 4 KiB leaf all three concatenate. For superpages, the lower PPN sub-fields are replaced by the matching VPN sub-fields instead: a gigapage (1 GiB) leaf at level 0 uses only PPN[2] from the PTE; a megapage (2 MiB) leaf at level 1 uses PPN[2]:PPN[1]. Each page table holds exactly 512 entries (4096 / 8), so each VPN sub-field is 9 bits (2⁹ = 512). Sv32 uses the same logic with 4-byte PTEs and 1024-entry tables (VPN sub-fields are 10 bits).
TLB and sfence.vma — walking the page table on every memory access would halve performance. Processors cache recent translations in a TLB. The TLB is not automatically kept coherent with the page table; when S-mode modifies page tables it must execute sfence.vma to flush stale entries. Optional arguments narrow the flush: rs1 scopes it to one virtual address, rs2 scopes it to one ASID; x0 for both flushes the entire TLB.
Future RISC-V Optional Extensions
All extensions are optional and modular. The RISC-V Foundation ratifies them only after public debate and at least one implementation, keeping the rate of change deliberately slow.
B — Bit Manipulation
Hardware instructions for bit-field insert/extract/test, rotations, funnel shifts, bit/byte permutations, and counts (leading zeros, trailing zeros, set bits).
E — Embedded
Reduces the integer register file from 32 to 16 (x0–x15) to cut die area on cost-constrained cores. Paired with RV32I as RV32E.
H — Hypervisor
Adds a hypervisor privilege level with a second stage of page-based address translation, enabling efficient concurrent execution of multiple OSes on one hart.
J — Dynamically Translated Languages
ISA support for JIT-compiled languages (Java, JavaScript): hardware dynamic runtime checks and accelerated garbage collection barriers.
L — Decimal Floating-Point
IEEE 754-2008 decimal FP arithmetic. Eliminates the binary approximation error for decimal fractions (e.g., 0.1) by matching the computation radix to the I/O radix.
N — User-Level Interrupts
Routes U-mode interrupts and exceptions directly to a user-level trap handler, bypassing M/S-mode. Primary use: secure embedded systems (M+U only). In Unix environments, a building block for user-level events such as GC barriers, integer overflow, and FP traps.
P — Packed-SIMD
Subdivides existing registers for data-parallel computation on narrow types, reusing wide datapaths. A lightweight alternative to V; prefer V when dedicated hardware resources are available.
Q — Quad-Precision Floating-Point
128-bit quad-precision binary FP compliant with IEEE 754-2008. FP registers extend to hold single, double, or quad-precision values. Requires RV64IFD.