Why RISC-V?
RISC-V is an open, modular ISA designed to run across the full computing spectrum — from embedded microcontrollers to supercomputers — without being owned or discontinued by any single company.
Modular vs. incremental design
Conventional architectures grow incrementally: every new processor must implement all past extensions to preserve binary compatibility. x86 expanded from 80 instructions to over 3,600, growing at roughly three per month.
RISC-V breaks this pattern with a strictly modular approach:
- A frozen, minimal base ISA (RV32I) that can run a complete software stack.
- Optional standard extensions for specific needs — hardware only includes what it requires.
- If software invokes an omitted extension, the hardware traps and a software library handles it.
Design metrics
Seven measures govern RISC-V architectural decisions:
- Cost: Die area scales non-linearly with cost (). Smaller dies improve both cost and yield. A RISC-V core requires roughly half the die area of an equivalent ARM-32 core.
- Simplicity: Complex instructions are often ignored by compilers anyway. Simpler ISAs reduce design and verification cost.
- Performance: A simpler ISA may need more instructions per program but enables faster clocks and lower CPI.
- Implementation isolation: ISA features optimized for one microarchitecture generation must not penalize future ones. Examples of past mistakes: delayed branches (helped 5-stage pipelines, hurt out-of-order cores) and load-multiple (good for single-issue, bad for multi-issue scheduling).
- Room for growth: Opcode space must be reserved for future custom accelerators. Exhausting it forces workarounds like separate 16-bit ISAs toggled via address bits.
- Program size: Smaller binaries reduce instruction cache misses and DRAM power. Combining 32-bit and 16-bit compressed instructions beats variable-length encodings burdened by legacy prefixes.
- Ease of programming and compiling: 32 integer registers simplify register allocation versus 8 or 16. Native PC-relative addressing supports position-independent code and dynamic linking.
Standard extensions
| Extension | Name | Purpose |
|---|---|---|
| M | Multiply/Divide | Integer multiply and divide; omitted on minimal embedded chips |
| F | Single-precision FP | IEEE 754 single-precision floating point |
| D | Double-precision FP | IEEE 754 double-precision floating point |
| A | Atomic | Load-reserved/store-conditional and AMOs for multiprocessor sync |
| C | Compressed | 16-bit encodings of common 32-bit instructions; ~400 gates of decoder overhead |
| V | Vector | Dynamic vector length and type per register, replacing fixed SIMD |
| RV64 | 64-bit | Widens registers and adds doubleword variants; preserves RV32 structure |
| Privileged | System | Machine/Supervisor/User modes, hardware paging, OS execution |
RV32G (or RV64G) denotes the combined IMAFD base — the standard general-purpose configuration.
Stability
The complete RISC-V specification is ~236 pages. Equivalent incremental architectures require 2,100–2,700 pages. A frozen base ISA paired with openly debated optional extensions keeps the compiler and OS targets stable indefinitely.
RV32I
RV32I is the frozen base integer ISA — 32-bit registers, 32-bit fixed-width instructions, enough to run a complete software stack.

Registers
- 32 general-purpose 32-bit registers (
x0–x31). x0is hardwired to zero, eliminating the need for dedicated zero-state or unary instructions. Moves and negations are synthesized using standard instructions withx0as a source.- The PC is separate from the register file. Keeping it out of the general registers prevents arbitrary arithmetic from causing control-flow side effects, which stabilizes branch prediction.
Instruction formats
Six fixed 32-bit formats cover all instruction classes:
- R-type: Register-to-register operations (two sources, one destination).
- I-type: Short immediates and loads.
- S-type: Stores.
- B-type: Conditional branches — a rotated variant of S-type.
- U-type: Long immediates (20-bit upper).
- J-type: Unconditional jumps — a rotated variant of U-type.

Key encoding choices:
- Fixed register specifiers:
rs1,rs2, andrdsit in identical bit positions across all formats, so register reads begin before decoding completes. - Sign bit always at bit 31: Immediate sign extension runs in parallel with decode.
- Rotated immediates: Immediate bits are scattered to minimize signal fanout and hardware multiplexing cost — the B and J scrambling looks odd but reduces wiring.
- Trap patterns: All-zeros and all-ones are illegal instructions, catching out-of-bounds jumps and unprogrammed memory.
Integer computation
- Arithmetic:
add,sub;addi(nosubi— use negative immediate). - Logical:
and,or,xorand their immediate forms. - Shifts:
sll,srl,sra(logical/arithmetic) with register or immediate shift amount. - Comparison:
slt,sltu,slti,sltiu— write 1 tordif true, 0 otherwise. - Upper immediates:
luiloads a 20-bit constant into the upper 20 bits;auipcadds it to the PC. Combining either with a 12-bit immediate instruction synthesizes any 32-bit constant or PC-relative address in two instructions.
Multiply, divide, and overflow detection are excluded to keep the minimal hardware footprint small. Overflow is handled in software; multiply/divide live in the M extension.
Loads and stores
Single addressing mode: base register + sign-extended 12-bit immediate.
lw/sw: 32-bit word.lh/sh: 16-bit halfword; loads sign-extend to 32 bits.lb/sb: 8-bit byte; loads sign-extend to 32 bits.lhu/lbu: zero-extending unsigned variants.
Memory is little-endian. Unaligned accesses are supported natively. No push/pop — stack operations are just sw/lw with the stack pointer register and displacement addressing.
Conditional branches
Branches compare two registers directly — no condition codes. Condition codes create implicit dependencies that stall out-of-order pipelines.
beq,bne: equality.blt,bge: signed magnitude.bltu,bgeu: unsigned magnitude.
Inverse comparisons use swapped operands (). The 12-bit immediate is multiplied by 2, sign-extended, and added to the PC. No delayed branches — they were removed to avoid binding the ISA to any particular pipeline depth.
Unconditional jumps
jal: PC-relative jump using a 20-bit immediate × 2; saves PC+4 tord(return address).jalr: Register-indirect jump usingbase + 12-bit immediate; saves PC+4 tord.
Setting rd = x0 discards the link, giving a plain jump or subroutine return.
System instructions
- CSR instructions (
csrrw,csrrs,csrrc+ immediate variants): read/write hardware counters — cycle timer, wall-clock time, instruction retirement count. ecall: Request a service from the OS or execution environment.ebreak: Transfer control to the debugger.fence: Order I/O and memory accesses visible to other threads or devices.fence.i: Flush the instruction pipeline so recent stores are visible to instruction fetch.
The ABI and OS conventions give these instructions their meaning — ecall alone does nothing without a defined calling convention.
Encoding reference

RISC-V Assembly
Calling convention
Function execution follows a fixed lifecycle: place arguments, jump (jal), acquire local storage and save registers, execute, place result and release storage, return (ret).
Registers are partitioned by preservation guarantee:
- Temporaries — not preserved across a call: arguments and return values (
a0–a7), temporaries (t0–t6), return address (ra). - Saved registers — callee must preserve:
s0–s11, stack pointer (sp). - Hardwired zero —
x0always reads as 0.

Stack frame:
- Prologue:
addi sp, sp, -framesizeto allocate; save registers to stack (e.g.,sw ra, framesize-4(sp)). - Epilogue: restore registers,
addi sp, sp, framesize, thenret.
RV32E: embedded variant that cuts the register file to 16 (x0–x15) to reduce die area.
Assembler directives and pseudoinstructions
Directives control data placement and code structure:
| Directive | Effect |
|---|---|
.text | Subsequent items go into the code section |
.data | Subsequent items go into initialized data |
.bss | Subsequent items go into zero-initialized data |
.section .foo | Subsequent items go into section .foo |
.align n | Align next datum to -byte boundary |
.balign n | Align next datum to exact -byte boundary |
.globl sym | Export sym as globally visible |
.string "str" | Store null-terminated string |
.byte b1,... | Store 8-bit values |
.half w1,... | Store 16-bit halfwords |
.word w1,... | Store 32-bit words |
.dword w1,... | Store 64-bit doublewords |
.float f1,... | Store single-precision FP values |
.double d1,... | Store double-precision FP values |
.option rvc (norvc) | Enable/disable compressed instruction emission |
.option pic (nopic) | Enable/disable position-independent code |
.option relax (norelax) | Enable/disable linker relaxation |
.option push (pop) | Save/restore current option state |
Pseudoinstructions map to one or more real instructions:
nop→addi x0, x0, 0ret→jalr x0, x1, 0mv rd, rs→addi rd, rs, 0beqz rs, offset→beq rs, x0, offsetli rd, imm→lui+addisequence for arbitrary 32-bit constantsla rd, symbol→auipc+ offset for PC-relative symbol addresses
Memory layout
High addresses
┌─────────────┐
│ Stack │ grows downward
├─────────────┤
│ ↓ │
│ ↑ │
├─────────────┤
│ Heap │ grows upward (dynamic allocation)
├─────────────┤
│ Static data │ globals, constants
├─────────────┤
│ Text │ machine instructions (starts at 0x00010000)
└─────────────┘
Low addresses
Position-independent code (PIC): uses PC-relative addressing (auipc, jalr) so the binary runs correctly regardless of where it is loaded in memory.
ABIs:
ilp32: The C language data typesint,long, and pointers are 32 bits; FP arguments pass through integer registers.ilp32f,ilp32d: single- or double-precision FP arguments pass through dedicated FP registers.
Linking
Linker relaxation: the linker replaces multi-instruction call sequences (auipc + jalr) with a single shorter instruction when the target is within ±2 KiB of the global pointer (gp) or thread pointer (tp).
Static linking: all library code is copied into the executable. Wastes memory when multiple programs share the same library, and ties the binary to a fixed library version.
Dynamic linking: libraries are mapped into memory at the moment of first call.
- First call hits a 3-instruction stub that invokes the dynamic linker, which maps the function and patches the symbol table pointer.
- Subsequent calls jump directly through the updated pointer.
- The library exists once in system memory regardless of how many processes use it.
Loader: the OS injects the binary into memory, starts the dynamic linker for any unresolved dependencies, and transfers control to the entry point.
RV32M
The M extension adds integer multiply and divide to the base ISA. It is optional — embedded chips that never need it can omit it entirely, with software fallback via trap.

Multiplication
Two 32-bit operands produce a 64-bit product. Rather than write to two destination registers at once, the result is retrieved in two separate instructions:
mul: lower 32 bits of the product (signed or unsigned — same bits either way).mulh: upper 32 bits, both operands signed.mulhu: upper 32 bits, both operands unsigned.mulhsu: upper 32 bits, one signed and one unsigned — used as a substep in multi-word signed multiplication.
Overflow detection:
- Unsigned: overflow absent if
mulhuresult is zero. - Signed: overflow absent if all bits of
mulhmatch the sign bit ofmul(0 for positive,0xFFFFFFFFfor negative).
Division and remainder
div/divu: signed and unsigned quotient.rem/remu: signed and unsigned remainder.
No hardware trap on divide-by-zero. Software handles it with a beqz check on the divisor before the division instruction.
Design notes
- Results go directly into general-purpose registers — no dedicated
HI/LOregisters like MIPS-32. Dedicated registers add architectural state, slow context switches, and require extra move instructions. - Compilers optimize constant division: powers of 2 use shifts (
srlfor unsigned ); other constants use multiplication by an approximate reciprocal plus correction on the upper half. - ARM-32 had no hardware divide at all until 2005.
RV32F and RV32D
F adds single-precision (32-bit) and D adds double-precision (64-bit) floating-point, both conforming to IEEE 754-2008.

Registers and state
- 32 dedicated FP registers
f0–f31, separate from the integer file. Doubling register bandwidth this way avoids widening the instruction register specifier fields. f0is a normal read-write register — unlike integerx0, it is not hardwired to zero.- When both F and D are implemented, single-precision operations use the lower 32 bits of the 64-bit
fregisters.
fcsr (floating-point control and status register):
frm— rounding mode: round-to-nearest-even (default), round-toward-zero, round-down, round-up, round-to-nearest-max-magnitude. Individual instructions can override via a static rounding mode argument.fflags— five accrued exception flags: Invalid (NV), Divide-by-zero (DZ), Overflow (OF), Underflow (UF), Inexact (NX).
Loads, stores, and register transfers
flw/fsw: 32-bit load/store usingbase + 12-bit immediate, same addressing as integer.fld/fsd: 64-bit load/store.fmv.x.w: copy a single-precision value fromftoxregister (bit-exact, no conversion).fmv.w.x: copy fromxtofregister.
Arithmetic
- Standard:
fadd,fsub,fmul,fdiv,fsqrt— all with.s(single) and.d(double) suffixes. fmin/fmax: write the smaller or larger of two operands directly, no branch needed.- Fused multiply-add (R4 format — three sources, one destination):
fmadd:fmsub:fnmadd:fnmsub:- A single rounding step at the end gives higher precision and speed than a separate multiply followed by add.
Comparisons and control flow
No dedicated FP branch instructions. Instead, comparisons write a boolean into an integer register, and standard integer branches act on it:
feq.s/d,flt.s/d,fle.s/d— write 1 or 0 to anxregister.
Conversion, sign injection, and classification
Conversion (fcvt family):
- Between signed/unsigned 32-bit integers and single/double precision in both directions.
- Between single and double precision (
fcvt.s.d,fcvt.d.s).
Sign injection — copies a value while manipulating only its sign bit:
fsgnj: take sign from a second source.fsgnjn: take inverted sign from a second source.fsgnjx: XOR the sign bits of both sources.- These underpin pseudoinstructions:
fabsusesfsgnjx(),fnegusesfsgnjn,fmvusesfsgnj.
Classification (fclass.s/d): writes a 10-bit one-hot mask to an integer register identifying which of the 10 IEEE 754 states the operand is in: , negative normal, negative subnormal, , , positive subnormal, positive normal, , signaling NaN, quiet NaN.
RV32A: Atomic Instructions
The A extension provides atomic instructions for synchronization in multiprocessor environments. All RV32A instructions require naturally aligned addresses — hardware cannot efficiently guarantee atomicity across cache-line boundaries.

LR/SC
LR/SC implements an atomic operation across two linked instructions, avoiding a three-operand instruction that would complicate the standard datapath.
lr.w(load reserved): reads a word from memory into a register and places a reservation on that address.sc.w(store conditional): attempts to write to the reserved address.- Succeeds: writes the value, sets destination register to 0.
- Fails (reservation broken by another hart): destination register gets a nonzero code; memory is unchanged.
This pair synthesizes any synchronization primitive, including compare-and-swap (CAS).
AMOs
AMOs execute a full read-modify-write as a single indivisible hardware operation — no interrupt or remote modification can occur between the read and the write.
Execution: read current value → apply ALU operation with a source register → write result back → return the original value to the destination register.
| Instruction | Operation |
|---|---|
amoswap.w | Swap |
amoadd.w | Add |
amoand.w, amoor.w, amoxor.w | Bitwise AND, OR, XOR |
amomin.w, amomax.w | Signed min/max |
amominu.w, amomaxu.w | Unsigned min/max |
AMOs scale better than LR/SC polling loops in large multiprocessor systems and streamline atomic I/O device communication.
Memory ordering
RISC-V uses a relaxed memory model — harts may observe accesses out of program order. All RV32A instructions carry two annotation bits to enforce ordering at critical points:
aq(acquire): when set, this atomic op is visible before all subsequent memory accesses by this hart.rl(release): when set, this atomic op is visible after all previous memory accesses by this hart.
Lock acquire sets aq to ensure the lock is held before guarded data is read. Lock release sets rl to ensure all data writes are visible before the lock is relinquished.
RV32C: Compressed Instructions
The C extension maps the most common 32-bit instructions to 16-bit encodings, shrinking binary size without changing the ISA visible to the compiler or programmer.
- The assembler transparently picks the 16-bit form whenever possible — the compiler emits normal instructions and never knows.
- The hardware decoder expands 16-bit instructions back to their 32-bit equivalents before execution, adding only ~400 gates of overhead.
- The two lowest bits of every 32-bit instruction are always
11; any other pattern signals a 16-bit instruction, making 16/32-bit interleaving unambiguous.
RV64C diverges slightly from RV32C: it drops c.jal (rare in 64-bit code) and word-sized load/stores (c.lw, c.sw, c.flw, etc.), replacing them with 64-bit variants (c.ld, c.sd, c.addw, c.addiw, c.subw).
Combined with the 32-bit base, RV32GC produces binaries significantly smaller than architectures with fixed-width encodings.