Detecting and Enhancing Loop-Level Parallelism

1. Loop-Level Parallelism and Dependences

Loop-level parallelism targets concurrency across multiple loop iterations and is primarily analyzed at the source-code level. Exploiting this parallelism requires identifying and categorizing the memory access relationships between instructions.

Loop-Carried Dependence: Occurs when data accessed in a later iteration depends on a value produced in an earlier iteration. This relationship typically forces successive iterations to execute sequentially.
Intra-Loop Dependence: Occurs when a statement depends on a value computed by a preceding statement within the exact same iteration. This does not inherently prevent the parallel execution of distinct iterations.
Induction Variables: Loop-carried dependences explicitly tied to loop index variables (e.g., $i = i + 1$ ). These can be recognized and eliminated mechanically during loop unrolling.
Circular Dependence: A cyclic relationship (e.g., $S_{1}$ depends on $S_{1}$ , or $S_{1}$ depends on $S_{2}$ which depends on $S_{1}$ ) that strictly prohibits parallel execution.
Acyclic Loop-Carried Dependence: A dependence where $S_{1}$ depends on $S_{2}$ from a previous iteration, but $S_{2}$ does not depend on $S_{1}$ .
- Because the dependence lacks a cycle, the statements form a partial ordering.
- Parallelism can be exposed by extracting the initial and final overlapping elements outside the loop and interchanging the statements inside the loop body to eliminate the inter-iteration delay.

Identifying these distinct dependency structures requires rigorous mathematical analysis of the memory addresses generated across the entire loop execution.

2. Finding Dependences

Dependence analysis evaluates memory references to determine if multiple accesses map to identical addresses, which is complicated by arrays, pointers, and pass-by-reference parameters. To make this tractable, compilers assume array indices are affine functions of the loop index $i$ .

Affine Indices: A one-dimensional array index is affine if it takes the form $a \times i + b$ , where $a$ and $b$ are constants.
Dependence Conditions: A dependence exists between a store instruction at index $a \times j + b$ and a load instruction at index $c \times k + d$ if:
- Both iteration indices $j$ and $k$ fall within the defined loop bounds ( $m \leq j \leq n$ and $m \leq k \leq n$ ).
- The effective addresses collide: $a \times j + b = c \times k + d$

While exact dependence determination across all scenarios is NP-complete, common constant-coefficient instances allow for highly efficient compile-time testing.

The Greatest Common Divisor (GCD) Test:
- A rapid heuristic used to determine if a loop-carried dependence is mathematically possible.
- Evaluates the condition that $GC D (c, a)$ must strictly divide the difference $(d - b)$ .
- Example: For an access pattern where $X [2 \times i + 3] = X [2 \times i] \times 5.0$ , the parameters are $a = 2, b = 3, c = 2, d = 0$ . $GC D (2, 2) = 2$ , which does not divide $(0 - 3) = - 3$ , proving no dependence can exist.
- Limitations: The GCD test is sufficient to guarantee the absence of a dependence, but a successful test does not guarantee a dependence actually exists because the test ignores the loop bounds.

Once mathematical overlaps are identified, the resulting dependences must be classified to determine if they represent true data flow constraints or merely superficial naming collisions.

3. Classifying and Eliminating Dependences

Memory address collisions between instructions dictate the execution order, but not all collisions restrict parallelization. Dependences fall into three categories:

True Dependence (Read-After-Write, RAW): An instruction requires a value computed by a preceding instruction. This represents fundamental data flow and must be preserved.
Antidependence (Write-After-Read, WAR): An instruction overwrites a variable after it has been read by a preceding instruction.
Output Dependence (Write-After-Write, WAW): An instruction overwrites a variable that was already written by a preceding instruction.

Antidependences and output dependences are pseudo-dependences. They do not transmit data values; instead, they arise purely from the reuse of identical variable names or memory locations.

Renaming:
- These pseudo-dependences can be safely eliminated by assigning new, unique variable names to the subsequent writes.
- This transformation can be executed via compiler-level name substitution or dynamic hardware register allocation.
- By resolving name conflicts, renaming removes artificial sequential constraints and unlocks the overlapping execution of formerly blocked instructions.

While renaming efficiently resolves pseudo-dependences, true loop-carried dependences require fundamental structural transformations to safely extract parallel performance.

4. Eliminating Dependent Computations

True loop-carried data flow often manifests as recurrences, where a variable is defined based on its own value from a preceding iteration.

Reductions: A specific, highly common recurrence structure that aggregates a vector into a single scalar value (e.g., a dot product where sum = sum + x[i] * y[i]). Reductions frequently employ operators such as addition, max, or min.
Scalar Expansion:
- A transformation that upgrades the scalar accumulator into a vector array, effectively severing the loop-carried dependence.
- Example transformation: sum[i] = x[i] * y[i] allows the core computation to execute in total parallel across all elements.
Parallel Reduction:
- Following scalar expansion, the expanded vector must be aggregated into the final scalar result.
- The array is partitioned across $p$ processors. Each processor computes a partial reduction over its subset independently, followed by a brief sequential loop to aggregate the final $p$ values. Vector and SIMD architectures frequently feature dedicated hardware to accelerate this specific step.
Associativity Constraints:
- Parallel reduction fundamentally relies on the mathematical associativity of the reduction operator.
- However, machine-level floating-point arithmetic (due to limited precision) and integer arithmetic (due to bounded range and overflow) are not strictly associative.
- Consequently, compilers will generally refuse to automatically apply restructuring techniques to recurrences unless explicitly authorized by the programmer to rely on associativity.

My Knowledge Base

Explorer

04 Loop-Level Parallelism

Detecting and Enhancing Loop-Level Parallelism

1. Loop-Level Parallelism and Dependences

2. Finding Dependences

3. Classifying and Eliminating Dependences

4. Eliminating Dependent Computations