GPUs

Graphics Processing Units (GPUs) are highly parallel, affordable computing devices consisting of thousands of floating-point units. Originally engineered for graphics generation, GPUs operate in a heterogeneous environment alongside a system processor, system memory, and dedicated graphics memory.

To fully exploit this massive parallelism, developers require a structured programming model that bridges the gap between hardware capabilities and software execution.

CUDA Programming Model

Compute Unified Device Architecture (CUDA) provides a C-like programming environment designed to manage heterogeneous execution and multifaceted parallelism. CUDA unifies multithreading, MIMD, SIMD, and instruction-level parallelism under a paradigm classified as Single Instruction, Multiple Thread (SIMT).

Execution separation: Functions designated with __device__ or __global__ execute on the GPU (device), while __host__ functions run on the system processor.
Thread Hierarchy:
- CUDA Thread: The fundamental programming primitive, mapping to a single data element.
- Thread Block: A grouping of CUDA Threads capable of executing independently and in any order. Threads within a block can communicate and synchronize via atomic memory operations.
- Grid: The top-level software structure representing the entire GPU computation, composed of multiple Thread Blocks.
Variable scope: Variables tagged with __device__ are mapped to GPU memory, making them accessible to all processors on the device.
Execution launch: Hardware invokes GPU functions using syntax specifying the grid and block dimensions (e.g., name<<<dimGrid, dimBlock>>>(...)).

Hardware Structure

The underlying hardware executes the SIMT software model using an array of multithreaded SIMD Processors (Streaming Multiprocessors or SMs).

Two-Level Scheduling Hierarchy:
- Thread Block Scheduler: Hardware that assigns entire Thread Blocks to available multithreaded SIMD Processors.
- SIMD Thread Scheduler: Resides within each SIMD Processor, selecting which ready thread of SIMD instructions to dispatch to execution units on a given clock cycle.
SIMD Lanes: Parallel execution units inside the SIMD Processor. For example, a 32-wide SIMD thread executing on 16 physical SIMD Lanes requires two clock cycles to complete one instruction.
Hardware Multithreading: GPUs hide long memory latencies by maintaining many active threads and rapidly context-switching among them.
Register Allocation: Fast context switching is enabled by massive register files (up to 65,536 32-bit registers per SIMD Processor). Registers are dynamically allocated to Thread Blocks upon creation. Fewer registers per thread allows for more concurrent active threads, requiring careful balancing by the programmer.

A multithreaded SIMD Processor that has 16 SIMD Lanes. The SIMD Thread Scheduler has, say, 64 independent threads of SIMD instructions that it schedules with a table of 64 program counters (PCs). Note that each lane has 1024 32-bit registers

PTX

NVIDIA targets an abstract instruction set architecture known as Parallel Thread Execution (PTX) rather than exposing the physical hardware instructions directly.

Abstraction and Translation: PTX provides compatibility across GPU generations. The translation from PTX to the internal hardware instruction format occurs in software at load time.
Instruction Format: Follows the structure opcode.type d, a, b, c, specifying the operation, data type (e.g., .b8, .f32, .u16), destination, and source operands.
Gather-Scatter Memory Access: Unlike traditional vector architectures, GPUs do not possess separate instructions for sequential, strided, or gather-scatter transfers; all memory accesses operate as gather-scatter.
Address Coalescing: To achieve unit-stride memory efficiency, an Address Coalescing Unit monitors the memory requests from SIMD Lanes. When a thread issues requests to adjacent addresses, the hardware dynamically coalesces them into a single, high-bandwidth block transfer.

Conditional Branching

GPUs process conditional statements using a combination of hardware mechanisms rather than relying purely on software-managed masks.

Branch Divergence: A branch diverges when some lanes within a SIMD thread follow one path while others follow a different path.
Hardware Synchronization Stack: Deeply nested control flow is managed by an internal stack. When a branch diverges, the GPU pushes a stack entry containing an identifier, target address, and thread-active mask. Instruction markers pop stack entries to converge the paths once the conditional execution completes.
Predication: For simple IF-THEN-ELSE statements, the PTX assembler generates predicated instructions using 1-bit per-lane predicate registers.
Execution Efficiency: All SIMD Lanes execute the operations for a given path, but only those with an active predicate mask store the result. Therefore, divergent conditional execution operates at a reduced efficiency (e.g., $\leq 50%$ for an equal-length IF-THEN-ELSE). If all mask bits uniformly evaluate to 0 or 1, the GPU skips the unused execution path entirely.

Memory

To support high-throughput parallel computation and mitigate the latency associated with off-chip access, GPUs utilize a heavily partitioned memory hierarchy.

Private Memory: Off-chip DRAM allocated individually to each SIMD Lane. It is utilized for stack frames, register spilling, and private variables. It is strictly unshared but cached in L1 and L2.
Local (Shared) Memory: On-chip SRAM assigned to each multithreaded SIMD Processor. It features low latency and high bandwidth (e.g., 128 bytes/clock). It is shared exclusively among SIMD Lanes within the same processor and is dynamically allocated to a Thread Block.
GPU (Global) Memory: Off-chip DRAM shared across the entire GPU, all Thread Blocks, and accessible by the system host.
Streaming Caches: While original GPUs relied strictly on multithreading to hide DRAM latency, modern architectures include caches to lower average latency and relieve register pressure.

All Grids (vectorized loops) share GPU memory, local memory is shared by all threads of SIMD instructions within a Thread Block (body of a vectorized loop), and private memory is private to a single CUDA Thread. Pascal allows preemption of a Grid, which requires that all local and private memory be able to be saved in and restored from global memory. For the sake of completeness, the GPU can also access CPU memory via the PCIe bus. This path is commonly used for a final result when its address is in host memory. This option eliminates a final copy from the GPU memory to the host memory.

Recent Innovations

Recent generations of GPUs have introduced aggressive microarchitectural enhancements to increase resource utilization, memory bandwidth, and system integration.

Dual SIMD Thread Schedulers: Advanced GPUs deploy multiple instruction dispatch units per SIMD Processor, allowing instructions from two independent threads to be issued simultaneously to different execution units (e.g., yielding 64 active lanes per cycle).
Enhanced Floating-Point Hardware: Dedicated support for fast single-, double-, and half-precision floating-point arithmetic.
High-Bandwidth Memory (HBM/HBM2): Integration of stacked memory utilizing ultra-wide buses (e.g., 4096 data wires) to provide peak bandwidths exceeding 700 GB/s.
NVLink Interconnect: A high-speed chip-to-chip interconnect providing up to 160 GB/s of bandwidth, mitigating PCIe bottlenecks and enabling seamless communication across multiple GPUs and host CPUs.
Unified Virtual Memory: Implementation of demand paging within a unified virtual address space that spans both CPUs and GPUs in the system.

Comparison with Vector and SIMD

vs. Vector Architectures

Processing Units: A GPU’s multithreaded SIMD Processor is functionally analogous to a vector processor, and SIMD Lanes act as vector lanes. However, GPUs utilize massive multithreading, which vector architectures generally omit.
Register Layout: Vector registers hold contiguous blocks of elements. In a GPU, a vector is distributed across the individual registers of the constituent SIMD Lanes.
Memory Access: Vector architectures utilize explicit unit-stride, non-unit stride, and gather-scatter instructions. GPUs implicitly rely on hardware Address Coalescing to optimize gather-scatter accesses into block transfers.
Control Flow: Vector architectures utilize a Control Processor to broadcast operations and compute addresses. GPUs distribute this responsibility between the Thread Block Scheduler and runtime address coalescing hardware.

vs. Multimedia SIMD

Similarities: Both leverage multiple SIMD lanes, utilize hardware multithreading to improve processor utilization, feature memory protection/demand paging, and utilize caching mechanisms.
Differences: Multimedia SIMD instructions execute on the host CPU in a tightly integrated manner. GPUs function as discrete multiprocessors across an I/O bus, possessing separate physical main memory, orders of magnitude more hardware threads, and native support for gather-scatter memory access.

Energy Efficiency

Data-level parallel architectures possess a fundamental power advantage derived from core system energy equations.
High performance and power efficiency are achieved by assuming and exploiting ample data-level parallelism (DLP).
Parallel execution models inherently favor processing wide data paths at slower frequencies over narrow, high-frequency execution to optimize the energy cost per operation.

High Bandwidth Memory

Substantial memory bandwidth is a strict prerequisite for vector architectures to process diverse and complex memory access patterns.
- Required access support includes unit stride, nonunit stride, and gather-scatter memory operations.
Maximum memory performance is achieved through the use of stacked DRAM technologies rather than conventional cache-based architectures.
- These memory standards are classified as High Bandwidth Memory (HBM), encompassing iterations such as HBM, HBM2, HBM3, and HBM4.
- Vertically stacked memory chips are integrated directly into the system packaging to supply extreme bandwidth.
HBM is the dominant memory architecture for top-end enterprise hardware, including discrete GPUs from AMD and NVIDIA, as well as the Intel Xeon Phi.

Strided Access and TLBs

Strided memory accesses create disruptive performance interactions with the Translation Lookaside Buffer (TLB) during virtual memory resolution.
- This translation bottleneck affects traditional vector architectures as well as modern GPUs, which natively utilize TLBs for memory mapping operations.
Specific alignments of TLB microarchitecture and target array sizes can induce worst-case translation thrashing.
- In the most severe cases, the system will suffer exactly one TLB miss for every single element accessed within the target array.
Non-sequential strided accesses generate similar collision patterns within standard hardware caches, though the resulting performance degradation is generally less severe than the latency penalty incurred by continuous TLB misses.

Tesla Versus Core i7

Processor Architecture and Capabilities:
- Intel Core i7-960: Manufactured on a 45 nm process, containing 4 cores and 700 million transistors, running at 3.2 GHz with a 130W power envelope.
- NVIDIA Tesla GTX 280: Manufactured on a 65 nm process, containing 30 Streaming Multiprocessors (SMs) and 1400 million transistors, running at 1.3 GHz with an identical 130W power envelope.
Peak Computational Throughput:
- Single-Precision (SP) Floating-Point (FP): The GTX 280 peaks at 624 GFLOP/s, significantly outpacing the Core i7’s 85.33 GFLOP/s.
- Double-Precision (DP) FP: The GTX 280 achieves 78 GFLOP/s, while the Core i7 reaches 42.66 GFLOP/s.
Memory Bandwidth:
- The GTX 280 delivers 127 GB/s of measured Stream bandwidth, providing $4.4 \times$ the bandwidth of the Core i7’s 16.4 GB/s.
Roofline Ridge Points:
- The ridge point dictates the arithmetic intensity required to transition from memory-bound to compute-bound execution.
- The GTX 280 DP ridge point sits at $0.6$ FLOP/byte, while the Core i7 DP ridge point sits at $2.6$ FLOP/byte.
- The lower ridge point of the GTX 280 ensures that peak computational performance can be achieved at substantially lower arithmetic intensities.

Workload Performance

Memory Bandwidth Constraints:
- Applications with working sets spanning hundreds of megabytes (e.g., LBM and SAXPY) exceed Core i7 cache capacities, resulting in $5.0 \times$ and $5.3 \times$ speedups on the GTX 280 due to its raw bandwidth advantage.
- Workloads processing large sparse matrices (SpMV) are constrained by DP FP limits rather than memory, restricting the GTX 280 advantage to $1.9 \times$ .
Compute Bandwidth Constraints:
- Strictly compute-bound kernels (SGEMM, Conv, FFT, MC, Bilat) scale directly with the raw FLOP/s capabilities defined by the roofline, yielding $2.8 \times$ to $5.7 \times$ speedups on the GTX 280.
- The Bilat kernel relies heavily on transcendental functions; the Core i7 spends 66% of its cycles on transcendentals, whereas the GTX 280 provides direct hardware support for these operations, resulting in a $5.7 \times$ speedup.
Cache Utilization and Blocking:
- Aggressive cache blocking on the Core i7 prevents data-intensive kernels from hitting the memory bandwidth roofline.
- Because of cache blocking, Ray casting (RC) is only $1.6 \times$ faster on the GTX 280, and SGEMM, FFT, and SpMV are shifted into compute-bound states on the Core i7.
- The Sort kernel executes $0.8 \times$ slower on the GTX 280 because the 1-bit split primitive requires significantly more instructions than a scalar sort operating entirely within the Core i7 cache.

Memory Addressing and Synchronization

Gather-Scatter Operations:
- Multimedia SIMD extensions on the Core i7 require data to be aligned on 16-byte boundaries, severely penalizing scattered data layouts.
- The GTX 280 implements native gather-scatter addressing, executing non-sequential memory accesses directly.
- The GTX 280 Address Coalescing Unit and memory controller dynamically batch concurrent thread requests to identical DRAM lines or pages, minimizing gather-scatter latency.
- The GJK kernel, which is highly dependent on scattered object data, achieves a $15.2 \times$ speedup on the GTX 280 specifically due to this native gather-scatter hardware.
Thread Synchronization and Atomics:
- Throughput on synchronization-bound kernels (Hist) depends entirely on atomic memory updates.
- The Core i7 utilizes a dedicated hardware fetch-and-increment instruction, holding the GTX 280 to a narrow $1.7 \times$ speedup.
- Kernels requiring the resolution of independent constraint batches followed by barrier synchronization (Solv) heavily favor the Core i7, which executes them $2.0 \times$ faster than the GTX 280.
- The Core i7 relies on its strict memory consistency model and atomic instructions to maintain order, whereas the GTX 280 lacks this memory consistency model, forcing it to launch synchronization batches inefficiently from the system processor.

Generational Evolution

CPU Enhancements (Intel Xeon Platinum 8180):
- Resolved the lack of non-sequential memory access by integrating hardware gather instructions (AVX2) and scatter instructions (AVX-512) directly into the SIMD execution units.
- Achieves a $50 \times$ aggregate performance improvement over the legacy Core i7-960.
GPU Enhancements (NVIDIA P100):
- Resolved synchronization and caching deficits by adding unified cache hierarchies and fast atomic operations.
- Improved DP FP performance ratios from $\frac{1}{8}$ the speed of SP FP (on the GTX 280) to $\frac{1}{2}$ the speed of SP FP.
- Achieves a $15 \times$ to $50 \times$ performance improvement over the legacy GTX 280.
Comparative Scaling:
- Despite CPU enhancements, the modern P100 GPU maintains a stable $2 \times$ to $5 \times$ throughput advantage over the modern Xeon 8180 across the core throughput workloads.

My Knowledge Base

Explorer

2 GPU