Loop-Level Parallelism

Loop-level parallelism targets concurrency across multiple loop iterations. Exploiting it requires identifying whether data accessed in a later iteration depends on a value produced in an earlier one — a loop-carried dependence. Dependences within a single iteration do not prevent parallelism across iterations.

Consider:

for (i=0; i<100; i=i+1) {
    A[i+1] = A[i] + C[i];   // S1
    B[i+1] = B[i] + A[i+1]; // S2
}

Two dependences exist. S1 reads A[i] which was written in the previous iteration — this is loop-carried and forces iterations to execute in series. S2 reads A[i+1] written by S1 in the same iteration — this is intra-iteration and does not prevent parallel execution of different iterations (as long as S1 runs before S2 within each iteration).

A loop-carried dependence does not always prevent parallelism. If the dependence has no cycle, the loop can still be parallelized after restructuring:

for (i=0; i<100; i=i+1) {
    A[i] = A[i] + B[i];   // S1
    B[i+1] = C[i] + D[i]; // S2
}

S1 depends on S2 from the previous iteration, but S2 does not depend on S1 — no cycle. The loop can be rewritten to expose parallelism by peeling the first and last elements and interchanging the statements:

A[0] = A[0] + B[0];
for (i=0; i<99; i=i+1) {
    B[i+1] = C[i] + D[i];
    A[i+1] = A[i+1] + B[i+1];
}
B[100] = C[99] + D[99];

The dependence is no longer loop-carried; iterations can overlap provided statement order within each iteration is preserved.

Finding Dependences

Compilers assume array indices are affine — expressible as $a \times i + b$ for constants $a$ and $b$ . A dependence exists between a store at index $a \times j + b$ and a load at index $c \times k + d$ if both $j$ and $k$ are within the loop bounds and $a \times j + b = c \times k + d$ .

Exact dependence determination is NP-complete in general, but the GCD test handles common constant cases cheaply: a loop-carried dependence is only possible if $g cd (a, c)$ divides $(d - b)$ . For X[2*i+3] = X[2*i] * 5.0, we have $a = 2, b = 3, c = 2, d = 0$ ; $g cd (2, 2) = 2$ does not divide $- 3$ , so no dependence exists. The GCD test guarantees absence of dependence when it fails, but a passing test does not guarantee one exists — it ignores loop bounds.

Eliminating Dependences

Not all dependences represent true data flow. Three types exist:

True dependence (RAW): A value written by one instruction is read by a later one. Must be preserved.
Antidependence (WAR): A later instruction writes a location that an earlier one reads.
Output dependence (WAW): Two instructions write the same location.

Antidependences and output dependences are false — they arise from name reuse, not actual data flow, and can be eliminated by renaming. Example:

for (i=0; i<100; i=i+1) {
    Y[i] = X[i] / c;  // S1
    X[i] = X[i] + c;  // S2 — antidependence on X with S1
    Z[i] = Y[i] + c;  // S3 — antidependence on Y with S4
    Y[i] = c - Y[i];  // S4 — output dependence on Y with S1
}

After renaming Y to T and X to X1:

for (i=0; i<100; i=i+1) {
    T[i]  = X[i] / c;
    X1[i] = X[i] + c;
    Z[i]  = T[i] + c;
    Y[i]  = c - T[i];
}

The false dependences are gone; the loop is now parallel.

Recurrences and Reductions

A recurrence is a loop-carried true dependence where a variable is defined from its own value in a prior iteration:

for (i=1; i<100; i=i+1)
    Y[i] = Y[i-1] + Y[i];

A common and important recurrence is reduction — aggregating a vector into a scalar:

for (i=9999; i>=0; i=i-1)
    sum = sum + x[i] * y[i];

This has a loop-carried dependence on sum. It can be parallelized through scalar expansion — promoting sum to a vector, making the first loop fully parallel, then reducing:

for (i=9999; i>=0; i=i-1)
    sum[i] = x[i] * y[i];   // fully parallel

for (i=9999; i>=0; i=i-1)
    finalsum = finalsum + sum[i];

The second loop has a specific reduction structure that hardware (vector units, SIMD) can accelerate. In practice, multiple processors each reduce a partition independently, then a short scalar loop combines the partial sums.

This transformation relies on associativity of addition. Floating-point arithmetic is not strictly associative, so compilers require the programmer to explicitly enable such restructuring.

My Knowledge Base

Explorer

3 Compiler DLP

Loop-Level Parallelism

Finding Dependences

Eliminating Dependences

Recurrences and Reductions