RV32V: Vector Architecture

Data-level parallelism allows concurrent computation on large datasets, primarily implemented via Single Instruction Multiple Data (SIMD) or vector architectures.

  • SIMD Architecture: Partitions standard 64-bit registers into smaller data elements
    • Embeds data width and operation directly into the instruction opcode
    • Requires new instructions and compiler modifications for every hardware register width expansion
    • Suffers from escalating ISA complexity and instruction bloat
  • Vector Architecture: Gathers discrete objects from main memory into long, sequential vector registers
    • Decouples the vector length and maximum operations per clock cycle from the instruction encoding
    • Processes vectors through pipelined execution units
    • Scatters results back to main memory
    • Reduces total instruction count and leverages stable compiler technology

The separation of vector length from the instruction encoding forms the foundation for a highly flexible and compact computation model.

Vector Computation Instructions

Vector computation instructions inherit core arithmetic, logical, and floating-point operations from the base integer and floating-point ISAs.

  • Operand Suffixes: Denote the classification of source registers
    • .vv: Vector-Vector operation (both operands are vector registers)
    • .vs: Vector-Scalar operation (first operand is vector, second is scalar x or f register)
    • .sv: Scalar-Vector operation (used exclusively for asymmetric operations like )
  • Fused Operations: Multiply-add instructions utilize three source operands and require a broader set of suffixes
    • Permutations include .vvv, .vvs, .vsv, and .vss

To avoid encoding the specific data type and element width into these instruction opcodes, the architecture employs a system of dynamic configuration.

Vector Registers and Dynamic Register Typing

The architecture provides 32 vector registers (v0v31), but the specific capacity of each register remains unconstrained by the ISA.

  • Dynamic Register Typing: Data types and element lengths are associated directly with the vector registers, not the instruction opcodes
    • vsetdcfg: Configures the active vector registers and establishes their data types
    • Supported types include integers (X8, X16, X32, X64), unsigned integers (X8U, X16U, X32U, X64U), and floating-point formats (F16, F32, F64)
    • Implicit conversions occur automatically when vector and scalar operands possess differing types
  • Maximum Vector Length (): An internal hardware threshold defining the maximum number of elements a vector register can hold
    • Calculated dynamically by the processor based on total dedicated vector SRAM and the specific data types requested
    • Disabling unused vector registers reallocates hardware memory to the active registers, proportionally increasing
  • Context Switching: Dynamic typing minimizes state-save overhead
    • Only explicitly enabled vector registers are saved or restored during an interrupt or context switch
    • Unused registers incur zero context-switch penalty

Once the hardware allocates register memory and configures the element types, data must be transferred from main memory into these dynamically sized registers.

Vector Loads and Stores

Data transfer operations dictate how memory structures map to sequential vector register elements, controlled by the current vector length (vl).

  • Sequential Access (Dense Arrays): vld (load) and vst (store)
    • Transfer elements from contiguous memory addresses
    • Utilize a 7-bit unsigned immediate offset scaled by the element size
  • Strided Access (Multi-dimensional Arrays): vlds and vsts
    • Enable row-major or column-major traversals
    • Require two scalar source registers: one for the base address, one for the byte stride between elements
  • Indexed Access (Sparse Arrays): vldx (gather) and vstx (scatter)
    • Support indirect data access via index tables
    • Require a scalar base address register and a vector register containing specific byte offsets for each element

With data successfully loaded into vector registers, the processor hardware executes the specified operations across multiple elements simultaneously.

Parallelism and Hardware Flexibility

Vector element operations are mathematically independent, permitting parallel hardware execution bounded only by microarchitectural resources.

  • Execution Width: Processors execute multiple elements per clock cycle (e.g., two, four, or eight 64-bit elements)
  • Width Ratios: Narrower data types naturally yield higher parallel throughput
    • A datapath capable of four 64-bit operations per cycle will inherently process eight 32-bit or sixteen 16-bit operations concurrently
  • Implementation Abstraction: The maximum vector length () and parallel execution width are hardware-defined
    • A single compiled binary scales automatically across different physical implementations without recompilation

While elements process in parallel, control flow within data-parallel loops requires element-level management to handle divergent execution paths.

Conditional Execution and Predication

Conditional operations inside vectorized loops are handled via predicate masking rather than branch instructions.

  • Vector Predicate Registers: Eight dedicated masking registers (vp0vp7)
    • Contain exactly as many 1-bit elements as the current
    • A bit value of 1 permits the corresponding vector element to be modified; 0 forces the element to remain unchanged
  • Predicate Generation: Comparison instructions (vplt, vpeq, etc.) evaluate vector conditions and populate a predicate register with the Boolean results
  • Mask Application: Vector computations explicitly specify either vp0 or vp1 as the governing mask
  • Predicate Manipulation:
    • vpswap: Rapidly exchanges another predicate register into the active vp0 or vp1 slot
    • Logical operations (vpand, vpor, vpxor, vpnot) allow predicates to be combined for complex nested conditionals

In addition to conditionally executing standard operations, specialized instructions manage dynamic loop lengths and internal element permutations.

Miscellaneous Vector Instructions

Advanced vector management relies on instructions that handle loop execution bounds and intra-register data movement.

  • Vector Length Management (setvl):
    • Configures the active vector length register (vl)
    • Automatically limits vector operations to valid array bounds, eliminating the need for separate edge-case loop logic
  • Element Permutation:
    • vselect: Gathers elements from a source vector using indices provided by a second vector
    • vmerge: Merges elements from two distinct source vectors based on the bit values of a predicate mask
    • vextract: Copies a subset of elements from a calculated starting point in one vector to the beginning of a destination vector
  • Recursive Halving (Reductions):
    • vextract facilitates highly efficient binary associative reductions (e.g., summing all elements in a vector)
    • The vector is iteratively split and added to itself until vl equals 1

Integrating these specialized instructions produces highly efficient code structures that drastically outperform alternative data-parallel architectures.

Architectural Comparisons: Vector vs. SIMD

The dynamic structural advantages of vector instructions result in significant performance and efficiency gains over incremental SIMD extensions (e.g., x86 AVX, ARM NEON, MIPS MSA).

  • Code Size and Density:
    • Vector code eliminates “strip-mining” bookkeeping and fringe-element handling
    • A vector loop handles arrays of any length (including zero) inherently via setvl
  • Instruction Bandwidth:
    • SIMD implementations require 10 to 20 times more dynamically executed instructions to process the same dataset due to short register lengths
    • Fewer instruction fetches and decodes in vector architecture directly translate to reduced energy consumption
  • ISA Stability:
    • SIMD mandates hundreds of new opcodes whenever hardware registers widen
    • Vector ISA remains static; execution automatically utilizes wider registers or expanded memory via dynamic scaling