Vector Architecture

Vector architectures grab sets of data elements scattered throughout memory, place them into large sequential register files, operate on them using deep pipelines, and disperse the results back to memory. A single vector instruction controls dozens of register-register operations on independent data elements. These large register files function as compiler-controlled buffers that hide memory latency and leverage memory bandwidth. By paying the long memory latency only once per vector load or store, vector architectures achieve high performance without the energy demands and design complexity of highly out-of-order superscalar processors.

To execute these operations efficiently, vector architectures rely on specialized hardware components.

Hardware Components

Vector registers: Large sequential storage structures. The RV64V architecture implements 32 vector registers. The register file provides multiple read and write ports (e.g., 16 read and 8 write ports) connected to functional units via crossbar switches to allow simultaneous vector operations.
Vector functional units: Fully pipelined execution units capable of starting a new operation on every clock cycle.
Vector load/store unit: Moves data between vector registers and memory with high bandwidth, typically one word per clock cycle after an initial latency.
Scalar registers: Standard general-purpose and floating-point registers that provide input data to vector functional units and compute addresses for the vector load/store unit.
Maximum Vector Length ( $V L M A X$ ): A hardware-specific parameter defining the maximum number of elements a vector operation can process, uncoupled from the instruction opcode.
Dynamic Register Typing: Vector registers are configured dynamically prior to execution rather than specifying sizes in individual instruction opcodes.
- A standard element width ( $SE W$ ) field specifies the data size (e.g., 8, 16, 32, or 64 bits).
- A length multiplier ( $L M UL$ ) field allows grouping multiple registers to form longer vectors, optimizing register file usage.
- Dynamic typing enables implicit type conversions during arithmetic operations.

These specialized components work together to process complete loops of data through deep pipelines.

Vector Execution and Chaining

Vector execution dramatically reduces dynamic instruction bandwidth. A single vector instruction replaces an entire scalar loop body and its associated loop-control overhead.
Pipeline stalls occur only once per vector instruction rather than once per individual vector element.
Chaining: A dependency resolution mechanism where the results from one vector functional unit are forwarded directly to the input of another functional unit.
Flexible chaining: Permits a vector instruction to chain to any active vector instruction simultaneously, provided no structural hazards are generated.

Understanding how these chained operations flow through the pipeline requires a specific timing metric.

Vector Execution Time

Convoy: A set of vector instructions that can safely execute together without structural hazards.
Chime: The unit of time required to execute a single convoy.
A sequence of vector instructions containing $m$ convoys executes in exactly $m$ chimes.
For an initiation rate of one element per cycle and a vector length of $n$ , the execution time of a vector sequence is approximately $m \times n$ clock cycles.

To improve upon this baseline execution time of one element per cycle, hardware can process multiple elements simultaneously.

Multiple Lanes

Vector instruction semantics guarantee that element operations are independent, simplifying the design of highly parallel execution units.
Vector lanes: Parallel execution pipelines added to functional units to process multiple vector elements per clock cycle.
Each lane holds a distinct, interleaved portion of the vector register file and one execution pipeline from each functional unit.
Calculations are restricted to the local lane, avoiding interlane communication overhead and reducing required register file ports.
Doubling the number of lanes scales peak throughput and halves the number of clock cycles required for a chime.

While parallel lanes accelerate operations on full vectors, programs frequently process datasets that do not perfectly align with hardware vector lengths.

Vector-Length Registers and Strip Mining

Algorithms frequently require vector lengths that are unknown at compile time or differ from $V L M A X$ .
Vector-length register ( $v l$ ): A dedicated register that controls the length of any active vector operation, strictly bound such that $v l \leq V L M A X$ .
Strip mining: A code generation technique used when the application vector length $n$ exceeds $V L M A X$ .
- The vector loop is partitioned into blocks.
- One loop executes lengths of exactly $V L M A X$ , while a fixup loop processes the remainder ( $n (mod V L M A X)$ ).
- The vsetvl instruction efficiently manages strip mining by automatically setting $v l$ to the minimum of $V L M A X$ and the remaining loop iterations.

Beyond mismatched vector lengths, programs often contain conditional logic that disrupts straight-line vector execution.

Mask Registers

Conditional statements (e.g., IF) inside loops create control dependences that inherently block standard vectorization.
Vector-mask control: Utilizes Boolean vectors stored in predicate registers (or the least-significant bit of a designated vector register) to control execution per element.
When a vector mask is active, subsequent vector instructions update destination elements only where the corresponding mask bit is 1; elements corresponding to 0 remain unaffected.
IF-conversion: The compiler process of transforming an IF statement into straight-line execution using vector-mask control.
Vector masking incurs execution overhead because the hardware pipeline must still consume time for elements where the mask bit is 0.

Successfully calculating conditionally masked elements still relies on the memory system’s ability to supply data at high speeds.

Memory Banks

Vector load/store units act as prefetch units, requiring massive memory bandwidth to sustain initiation rates of one word per clock cycle.
Start-up penalties for load/store units are high (frequently over 100 clock cycles) but are amortized across the entire vector block.
Memory systems spread accesses across multiple independent memory banks to produce data at the required rate.
Multiple independent banks are necessary to support non-sequential access patterns.

Independent memory banks become critical when algorithms access data structures in non-sequential patterns.

Stride

Stride: The precise memory distance separating elements that are to be gathered into a single vector register.
Multidimensional arrays (e.g., matrices) are linearized in memory, meaning operations along one dimension (like a column in a row-major array) require accessing non-adjacent memory locations.
Vector architectures resolve this by providing explicit load/store instructions with a scalar stride parameter, gathering non-contiguous memory into dense vector registers.
Memory Bank Conflicts: Nonunit strides can cause multiple elements to map to the same memory bank, resulting in a stall. A conflict occurs if: $\frac{Number of banks}{GCD ( Stride , Number of banks )} < Bank busy time$ .

While strided accesses handle regular geometric patterns, many algorithms utilize highly irregular data structures.

Gather-Scatter

Sparse matrices utilize compacted memory representations where elements are accessed indirectly.
Index vectors: Vector registers containing memory offsets mapping to non-zero elements.
Gather (Load vector indexed): Hardware fetches a vector using a base address added to the offsets in an index vector, populating a dense vector register.
Scatter (Store vector indexed): Hardware stores a dense vector register back into a sparse memory layout using an index vector.
Gather/scatter instructions execute slower than unit-stride loads/stores due to complex individual address generation and high probability of memory bank conflicts.

Leveraging these advanced memory and masking capabilities requires tight integration with compilation strategies.

Programming Vector Architectures

Vector compilers provide explicit feedback to programmers at compile time, identifying exactly why a code block failed to vectorize (e.g., identifying loop-carried dependences).
This feedback loop allows programmers to manually restructure algorithms or apply explicit directives.
Programmer directives are heavily required for features like gather-scatter, where the compiler cannot statically guarantee that an index vector contains distinct, dependency-free values.

My Knowledge Base

Explorer

01 Vector Architecture