Loop-Level Parallelism
Loop-level parallelism targets concurrency across multiple loop iterations. Exploiting it requires identifying whether data accessed in a later iteration depends on a value produced in an earlier one — a loop-carried dependence. Dependences within a single iteration do not prevent parallelism across iterations.
Consider:
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i]; // S1
B[i+1] = B[i] + A[i+1]; // S2
}
Two dependences exist. S1 reads A[i] which was written in the previous iteration — this is loop-carried and forces iterations to execute in series. S2 reads A[i+1] written by S1 in the same iteration — this is intra-iteration and does not prevent parallel execution of different iterations (as long as S1 runs before S2 within each iteration).
A loop-carried dependence does not always prevent parallelism. If the dependence has no cycle, the loop can still be parallelized after restructuring:
for (i=0; i<100; i=i+1) {
A[i] = A[i] + B[i]; // S1
B[i+1] = C[i] + D[i]; // S2
}
S1 depends on S2 from the previous iteration, but S2 does not depend on S1 — no cycle. The loop can be rewritten to expose parallelism by peeling the first and last elements and interchanging the statements:
A[0] = A[0] + B[0];
for (i=0; i<99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[100] = C[99] + D[99];
The dependence is no longer loop-carried; iterations can overlap provided statement order within each iteration is preserved.
Finding Dependences
Compilers assume array indices are affine — expressible as for constants and . A dependence exists between a store at index and a load at index if both and are within the loop bounds and .
Exact dependence determination is NP-complete in general, but the GCD test handles common constant cases cheaply: a loop-carried dependence is only possible if divides . For X[2*i+3] = X[2*i] * 5.0, we have ; does not divide , so no dependence exists. The GCD test guarantees absence of dependence when it fails, but a passing test does not guarantee one exists — it ignores loop bounds.
Eliminating Dependences
Not all dependences represent true data flow. Three types exist:
- True dependence (RAW): A value written by one instruction is read by a later one. Must be preserved.
- Antidependence (WAR): A later instruction writes a location that an earlier one reads.
- Output dependence (WAW): Two instructions write the same location.
Antidependences and output dependences are false — they arise from name reuse, not actual data flow, and can be eliminated by renaming. Example:
for (i=0; i<100; i=i+1) {
Y[i] = X[i] / c; // S1
X[i] = X[i] + c; // S2 — antidependence on X with S1
Z[i] = Y[i] + c; // S3 — antidependence on Y with S4
Y[i] = c - Y[i]; // S4 — output dependence on Y with S1
}
After renaming Y to T and X to X1:
for (i=0; i<100; i=i+1) {
T[i] = X[i] / c;
X1[i] = X[i] + c;
Z[i] = T[i] + c;
Y[i] = c - T[i];
}
The false dependences are gone; the loop is now parallel.
Recurrences and Reductions
A recurrence is a loop-carried true dependence where a variable is defined from its own value in a prior iteration:
for (i=1; i<100; i=i+1)
Y[i] = Y[i-1] + Y[i];
A common and important recurrence is reduction — aggregating a vector into a scalar:
for (i=9999; i>=0; i=i-1)
sum = sum + x[i] * y[i];
This has a loop-carried dependence on sum. It can be parallelized through scalar expansion — promoting sum to a vector, making the first loop fully parallel, then reducing:
for (i=9999; i>=0; i=i-1)
sum[i] = x[i] * y[i]; // fully parallel
for (i=9999; i>=0; i=i-1)
finalsum = finalsum + sum[i];
The second loop has a specific reduction structure that hardware (vector units, SIMD) can accelerate. In practice, multiple processors each reduce a partition independently, then a short scalar loop combines the partial sums.
This transformation relies on associativity of addition. Floating-point arithmetic is not strictly associative, so compilers require the programmer to explicitly enable such restructuring.