Graphics Processing Units (GPUs)

Graphics Processing Units (GPUs) are highly parallel, affordable computing devices consisting of thousands of floating-point units. Originally engineered for graphics generation, GPUs operate in a heterogeneous environment alongside a system processor, system memory, and dedicated graphics memory. Modern GPUs have evolved to accelerate broad data-level parallel (DLP) tasks, dominating both scientific computation and machine learning workloads.

To fully exploit this massive parallelism, developers require a structured programming model that bridges the gap between hardware capabilities and software execution.

The CUDA Programming Model (SIMT)

Compute Unified Device Architecture (CUDA) provides a C-like programming environment designed to manage heterogeneous execution and multifaceted parallelism. CUDA unifies multithreading, MIMD, SIMD, and instruction-level parallelism under a paradigm classified as Single Instruction, Multiple Thread (SIMT).

Execution separation: Functions designated with __device__ or __global__ execute on the GPU (device), while __host__ functions run on the system processor.
Thread Hierarchy:
- CUDA Thread: The fundamental programming primitive, mapping to a single data element.
- Thread Block: A grouping of CUDA Threads capable of executing independently and in any order. Threads within a block can communicate and synchronize via atomic memory operations.
- Grid: The top-level software structure representing the entire GPU computation, composed of multiple Thread Blocks.
Variable scope: Variables tagged with __device__ are mapped to GPU memory, making them accessible to all processors on the device.
Execution launch: Hardware invokes GPU functions using syntax specifying the grid and block dimensions (e.g., name<<<dimGrid, dimBlock>>>(...)).

By structuring computation into grids and blocks, the programming model perfectly aligns with the hardware mechanisms responsible for distributing and executing these workloads across physical processors.

GPU Computational Structures and Scheduling

The underlying hardware executes the SIMT software model using an array of multithreaded SIMD Processors (Streaming Multiprocessors or SMs).

Two-Level Scheduling Hierarchy:
- Thread Block Scheduler: Hardware that assigns entire Thread Blocks to available multithreaded SIMD Processors.
- SIMD Thread Scheduler: Resides within each SIMD Processor, selecting which ready thread of SIMD instructions to dispatch to execution units on a given clock cycle.
SIMD Lanes: Parallel execution units inside the SIMD Processor. For example, a 32-wide SIMD thread executing on 16 physical SIMD Lanes requires two clock cycles to complete one instruction.
Hardware Multithreading: GPUs hide long memory latencies by maintaining many active threads and rapidly context-switching among them.
Register Allocation: Fast context switching is enabled by massive register files (up to 65,536 32-bit registers per SIMD Processor). Registers are dynamically allocated to Thread Blocks upon creation. Fewer registers per thread allows for more concurrent active threads, requiring careful balancing by the programmer.

To control these physical execution units and manage the dynamic register allocations, the GPU relies on a specialized instruction set architecture designed for parallel thread execution.

Instruction Set Architecture: PTX

NVIDIA targets an abstract instruction set architecture known as Parallel Thread Execution (PTX) rather than exposing the physical hardware instructions directly.

Abstraction and Translation: PTX provides compatibility across GPU generations. The translation from PTX to the internal hardware instruction format occurs in software at load time.
Instruction Format: Follows the structure opcode.type d, a, b, c, specifying the operation, data type (e.g., .b8, .f32, .u16), destination, and source operands.
Gather-Scatter Memory Access: Unlike traditional vector architectures, GPUs do not possess separate instructions for sequential, strided, or gather-scatter transfers; all memory accesses operate as gather-scatter.
Address Coalescing: To achieve unit-stride memory efficiency, an Address Coalescing Unit monitors the memory requests from SIMD Lanes. When a thread issues requests to adjacent addresses, the hardware dynamically coalesces them into a single, high-bandwidth block transfer.

While address coalescing handles uniform data access efficiently, data-parallel algorithms also frequently encounter non-uniform control flow, necessitating specialized mechanisms for conditional logic.

Conditional Branching in GPUs

GPUs process conditional statements using a combination of hardware mechanisms rather than relying purely on software-managed masks.

Branch Divergence: A branch diverges when some lanes within a SIMD thread follow one path while others follow a different path.
Hardware Synchronization Stack: Deeply nested control flow is managed by an internal stack. When a branch diverges, the GPU pushes a stack entry containing an identifier, target address, and thread-active mask. Instruction markers pop stack entries to converge the paths once the conditional execution completes.
Predication: For simple IF-THEN-ELSE statements, the PTX assembler generates predicated instructions using 1-bit per-lane predicate registers.
Execution Efficiency: All SIMD Lanes execute the operations for a given path, but only those with an active predicate mask store the result. Therefore, divergent conditional execution operates at a reduced efficiency (e.g., $\leq 50$ for an equal-length IF-THEN-ELSE). If all mask bits uniformly evaluate to 0 or 1, the GPU skips the unused execution path entirely.

The efficiency of both conditional execution and address coalescing is heavily influenced by how quickly the SIMD lanes can retrieve and store their data within the GPU’s memory hierarchy.

GPU Memory Structures

To support high-throughput parallel computation and mitigate the latency associated with off-chip access, GPUs utilize a heavily partitioned memory hierarchy.

Private Memory: Off-chip DRAM allocated individually to each SIMD Lane. It is utilized for stack frames, register spilling, and private variables. It is strictly unshared but cached in L1 and L2.
Local (Shared) Memory: On-chip SRAM assigned to each multithreaded SIMD Processor. It features low latency and high bandwidth (e.g., 128 bytes/clock). It is shared exclusively among SIMD Lanes within the same processor and is dynamically allocated to a Thread Block.
GPU (Global) Memory: Off-chip DRAM shared across the entire GPU, all Thread Blocks, and accessible by the system host.
Streaming Caches: While original GPUs relied strictly on multithreading to hide DRAM latency, modern architectures include caches to lower average latency and relieve register pressure.

As memory bottlenecks were addressed, manufacturers scaled the underlying architecture to maximize throughput and adapt to newer workloads like deep learning.

Innovations in Recent GPU Architectures

Recent generations of GPUs have introduced aggressive microarchitectural enhancements to increase resource utilization, memory bandwidth, and system integration.

Dual SIMD Thread Schedulers: Advanced GPUs deploy multiple instruction dispatch units per SIMD Processor, allowing instructions from two independent threads to be issued simultaneously to different execution units (e.g., yielding 64 active lanes per cycle).
Enhanced Floating-Point Hardware: Dedicated support for fast single-, double-, and half-precision floating-point arithmetic.
High-Bandwidth Memory (HBM/HBM2): Integration of stacked memory utilizing ultra-wide buses (e.g., 4096 data wires) to provide peak bandwidths exceeding 700 GB/s.
NVLink Interconnect: A high-speed chip-to-chip interconnect providing up to 160 GB/s of bandwidth, mitigating PCIe bottlenecks and enabling seamless communication across multiple GPUs and host CPUs.
Unified Virtual Memory: Implementation of demand paging within a unified virtual address space that spans both CPUs and GPUs in the system.

Understanding these advanced structures clarifies how GPUs compare against traditional approaches to data-level parallelism, specifically vector and multimedia SIMD architectures.

Comparing GPUs with Vector and Multimedia SIMD Architectures

While GPUs share foundational goals with both Vector processors and Multimedia SIMD processors, they differ significantly in their execution paradigms and memory interactions.

GPUs vs. Vector Architectures

Processing Units: A GPU’s multithreaded SIMD Processor is functionally analogous to a vector processor, and SIMD Lanes act as vector lanes. However, GPUs utilize massive multithreading, which vector architectures generally omit.
Register Layout: Vector registers hold contiguous blocks of elements. In a GPU, a vector is distributed across the individual registers of the constituent SIMD Lanes.
Memory Access: Vector architectures utilize explicit unit-stride, non-unit stride, and gather-scatter instructions. GPUs implicitly rely on hardware Address Coalescing to optimize gather-scatter accesses into block transfers.
Control Flow: Vector architectures utilize a Control Processor to broadcast operations and compute addresses. GPUs distribute this responsibility between the Thread Block Scheduler and runtime address coalescing hardware.

GPUs vs. Multimedia SIMD

Similarities: Both leverage multiple SIMD lanes, utilize hardware multithreading to improve processor utilization, feature memory protection/demand paging, and utilize caching mechanisms.
Differences: Multimedia SIMD instructions execute on the host CPU in a tightly integrated manner. GPUs function as discrete multiprocessors across an I/O bus, possessing separate physical main memory, orders of magnitude more hardware threads, and native support for gather-scatter memory access.

Would you like me to find sources comparing specific implementations of these architectures, such as the NVIDIA A100 versus the Google TPU v4?

My Knowledge Base

Explorer

03 GPU Architecture