Multiprocessing

Multiprocessing has become the primary mechanism to scale computational performance due to physical limitations in scaling single-core architectures.

Key architectural drivers:
- Inefficiency of ILP: There are severe diminishing returns in silicon area and energy efficiency when attempting to extract further Instruction-Level Parallelism (ILP).
- Power constraints: The end of Dennard scaling imposes strict thermal and power limits, forcing a shift from fast, complex uniprocessors to multiple efficient cores.
- Workload shifts: The computing landscape is increasingly dominated by cloud computing, software-as-a-service, and data-intensive applications operating on massive datasets.
- Design leverage: Multiprocessors offer high cost-performance by replicating commodity processor cores rather than engineering unique monolithic designs.
Thread-Level Parallelism (TLP):
- TLP relies on the existence of multiple program counters to execute independent paths of execution concurrently.
- It is implemented primarily through Multiple Instruction, Multiple Data (MIMD) architectures.

To exploit TLP effectively, independent processor cores must be organized into cohesive systems that share data and memory structures.

Memory Organization

Shared-memory multiprocessors are computers comprising tightly coupled processors that are managed by a single operating system and communicate through a unified shared address space.

TLP Execution Models:
- Parallel processing: A tightly coupled set of threads collaborating on a single unified task.
- Request-level parallelism: Multiple independent processes or applications running concurrently, often driven by separate user queries (multiprogramming).
Grain Size:
- Grain size defines the amount of computation assigned to a single thread.
- TLP threads execute thousands to billions of instructions, operating at a much coarser granularity than ILP.
- While threads can be used to exploit fine-grained data-level parallelism, the management overhead is prohibitively expensive compared to using dedicated SIMD processors.
Shared-Memory Topologies:
- Uniform Memory Access (UMA): All processors experience identical memory access latency. Modern UMA architectures replace legacy shared buses with a shared Last Level Cache (LLC) connected to private caches via an on-chip interconnection network.

Modern multicores use two levels of private cache and a shared (sometimes non-inclusive) L3, sliced into multiple banks each associated with one or two cores. The legacy shared bus is replaced by an interconnection network — often multistage and indirect for larger designs, requiring multiple hops. Memory and I/O are accessed through the same network. Even UMA designs may be NUCA, since the time to reach a given LLC bank varies by core location.

Nonuniform Memory Access (NUMA): Utilizes Distributed Shared Memory (DSM) spanning multiple chips or nodes. Access latency is highly dependent on the physical distance between the requesting core and the target physical memory. NUMA systems are often combined with Nonuniform Cache Access (NUCA) to distribute the LLC.

A distributed-memory multiprocessor node consists of a multicore chip with a shared LLC, local memory, I/O, and an interface to the interconnection network that links all nodes. Every core can address the full memory space, but local memory is significantly faster than remote memory. Most such designs also exhibit NUCA.

Performance Challenges

Extracting performance from interacting parallel threads requires overcoming strict limits in available parallelism and physical communication delays.

Limited Parallelism (Amdahl’s Law):
- The performance gain of a multiprocessor is strictly limited by the fraction of the computation that must execute sequentially.
- Theoretical speedup is governed by the equation: $Speedup = \frac{1}{\frac{Fraction _{enhanced}}{Speedup _{enhanced}} + ( 1 - Fraction _{enhanced} )}$
- Achieving high speedup at scale requires near-zero sequential execution. For example, achieving an $80 \times$ speedup on $100$ processors dictates that exactly $99.705%$ of the execution must be completely parallel, leaving only $0.25%$ for the serial portion.
Communication Latency:
- Remote memory accesses incur severe delays due to physical distance and interconnect routing overhead.
- Communication between distinct cores costs $50$ to $100 +$ clock cycles, whereas communication across separate chips can require $500$ to $1000 +$ clock cycles.
- These latencies devastate pipeline efficiency. If a processor with a $0.5$ base CPI and a $2.0 GHz$ clock ( $0.5 ns$ cycle) encounters a $200 ns$ remote memory delay ( $400$ penalty cycles), an instruction stream with just $0.2%$ remote accesses will see its effective CPI degrade from $0.5$ to $1.3$ .

System designers employ a mix of architectural features and software optimizations to reduce the impact of remote communication latency and insufficient parallelism.

Software Interventions:
- Design and implement new algorithms that offer superior parallel scaling.
- Restructure data layouts to maximize local memory accesses and minimize the frequency of remote communication.
Hardware Interventions:
- Caching: Store shared data in local caches to dramatically reduce the frequency of remote memory requests.
- Multithreading: Rapidly interleave the execution of multiple threads on a single core to tolerate and hide communication latency.
- Prefetching: Fetch data into local caches prior to explicit demand to mask the latency of data retrieval.

My Knowledge Base

Explorer

1 Multiproccesing

Multiprocessing

Memory Organization

Performance Challenges