Cross-Cutting Issues in Data-Level Parallelism
Energy and Data-Level Parallelism: Slow and Wide Versus Fast and Narrow
- Data-level parallel architectures possess a fundamental power advantage derived from core system energy equations.
- High performance and power efficiency are achieved by assuming and exploiting ample data-level parallelism (DLP).
- Parallel execution models inherently favor processing wide data paths at slower frequencies over narrow, high-frequency execution to optimize the energy cost per operation.
Sustaining the energy-efficient computational throughput of wide DLP hardware requires commensurate data delivery capabilities, directly driving physical innovations in memory design.
Banked Memory and High Bandwidth Memory
- Substantial memory bandwidth is a strict prerequisite for vector architectures to process diverse and complex memory access patterns.
- Required access support includes unit stride, nonunit stride, and gather-scatter memory operations.
- Maximum memory performance is achieved through the use of stacked DRAM technologies rather than conventional cache-based architectures.
- These memory standards are classified as High Bandwidth Memory (HBM), encompassing iterations such as HBM, HBM2, HBM3, and HBM4.
- Vertically stacked memory chips are integrated directly into the system packaging to supply extreme bandwidth.
- HBM is the dominant memory architecture for top-end enterprise hardware, including discrete GPUs from AMD and NVIDIA, as well as the Intel Xeon Phi.
While stacked memory provides the raw bandwidth necessary for non-sequential data loads, these scattered and strided access patterns introduce severe bottlenecks in the hardware address translation layer.
Strided Accesses and TLB Misses
- Strided memory accesses create disruptive performance interactions with the Translation Lookaside Buffer (TLB) during virtual memory resolution.
- This translation bottleneck affects traditional vector architectures as well as modern GPUs, which natively utilize TLBs for memory mapping operations.
- Specific alignments of TLB microarchitecture and target array sizes can induce worst-case translation thrashing.
- In the most severe cases, the system will suffer exactly one TLB miss for every single element accessed within the target array.
- Non-sequential strided accesses generate similar collision patterns within standard hardware caches, though the resulting performance degradation is generally less severe than the latency penalty incurred by continuous TLB misses.