Loop Unrolling

Loop unrolling replicates the loop body multiple times to expose more ILP across a larger straight-line code region. It increases data operations relative to branch and loop maintenance overhead.

To schedule an unrolled loop effectively, the compiler must:

  • Prove iteration independence: Confirm loop iterations are independent aside from loop maintenance code.
  • Rename registers: Reusing the same registers across unrolled iterations creates antidependences and output dependences that force sequential execution. Assigning distinct registers to each copy removes these constraints.
  • Symbolic substitution: Fold constants and compute per-iteration address offsets statically, collapsing multiple pointer updates into one.
  • Memory disambiguation: Verify that loads and stores from different iterations do not alias.
  • Code scheduling: Interleave independent loads and stores to fill latency slots.

Applying all of these to the 4x unrolled loop eliminates three branches and three pointer decrements and allows loads from different iterations to be hoisted before the adds — 14 cycles, zero stalls, 3.5 cycles per element, down from 6.5 unscheduled and 7 for the original scheduled loop. The gain from scheduling is larger on the unrolled loop because more independent instructions are exposed.

Limits to further unrolling:

  • Diminishing returns: The incremental benefit of eliminating loop overhead shrinks as the unroll factor grows.
  • Code size: Aggressive unrolling expands the program footprint, potentially causing instruction cache misses.
  • Register pressure: More live values compete for registers; spilling to memory can erase the gains.

Pipeline Scheduling

To keep a pipeline full, a compiler must find sequences of independent instructions that can be overlapped during execution.

  • To avoid a stall, a dependent instruction must be separated from its source by a distance in clock cycles equal to the pipeline latency of that source.
  • The compiler’s ability to schedule depends on the ILP available in the program and the latencies of the pipeline’s functional units.
  • Within a basic block, available ILP is often insufficient to fully hide latencies, requiring analysis across branch boundaries.

Assume fully pipelined functional units with no structural hazards. Latencies (stall cycles needed between producer and consumer):

ProducerConsumerStalls
ALU opALU op2
ALU opStore1
LoadALU op1
LoadStore0
Integer ALU / branchAny0

Consider adding a scalar to each element of an array:

for (i=999; i>=0; i=i-1)
    x[i] = x[i] + s;

The straightforward RISC-V translation (x1 = array pointer, x2 = end, f2 = s) takes 9 cycles per element unscheduled due to stalls after the load and add:

Loop: ld    x3, 0(x1)
      add   x3, x3, x4      // stall: load latency
      sd    x3, 0(x1)       // stall stall: add latency
      addi  x1, x1, -8
      bne   x1, x2, Loop

Moving addi into the load’s delay slot and adjusting the store offset reduces this to 7 cycles with 2 stalls remaining:

Loop: ld    x3, 0(x1)
      addi  x1, x1, -8      // fills load delay slot
      add   x3, x3, x4      // stall stall: add latency
      sd    x3, 8(x1)       // offset adjusted for addi
      bne   x1, x2, Loop

Only 3 of those 7 cycles do useful work; the rest is loop overhead and stalls.