Loop Unrolling
Loop unrolling replicates the loop body multiple times to expose more ILP across a larger straight-line code region. It increases data operations relative to branch and loop maintenance overhead.
To schedule an unrolled loop effectively, the compiler must:
- Prove iteration independence: Confirm loop iterations are independent aside from loop maintenance code.
- Rename registers: Reusing the same registers across unrolled iterations creates antidependences and output dependences that force sequential execution. Assigning distinct registers to each copy removes these constraints.
- Symbolic substitution: Fold constants and compute per-iteration address offsets statically, collapsing multiple pointer updates into one.
- Memory disambiguation: Verify that loads and stores from different iterations do not alias.
- Code scheduling: Interleave independent loads and stores to fill latency slots.
Applying all of these to the 4x unrolled loop eliminates three branches and three pointer decrements and allows loads from different iterations to be hoisted before the adds — 14 cycles, zero stalls, 3.5 cycles per element, down from 6.5 unscheduled and 7 for the original scheduled loop. The gain from scheduling is larger on the unrolled loop because more independent instructions are exposed.
Limits to further unrolling:
- Diminishing returns: The incremental benefit of eliminating loop overhead shrinks as the unroll factor grows.
- Code size: Aggressive unrolling expands the program footprint, potentially causing instruction cache misses.
- Register pressure: More live values compete for registers; spilling to memory can erase the gains.
Pipeline Scheduling
To keep a pipeline full, a compiler must find sequences of independent instructions that can be overlapped during execution.
- To avoid a stall, a dependent instruction must be separated from its source by a distance in clock cycles equal to the pipeline latency of that source.
- The compiler’s ability to schedule depends on the ILP available in the program and the latencies of the pipeline’s functional units.
- Within a basic block, available ILP is often insufficient to fully hide latencies, requiring analysis across branch boundaries.
Assume fully pipelined functional units with no structural hazards. Latencies (stall cycles needed between producer and consumer):
| Producer | Consumer | Stalls |
|---|---|---|
| ALU op | ALU op | 2 |
| ALU op | Store | 1 |
| Load | ALU op | 1 |
| Load | Store | 0 |
| Integer ALU / branch | Any | 0 |
Consider adding a scalar to each element of an array:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
The straightforward RISC-V translation (x1 = array pointer, x2 = end, f2 = s) takes 9 cycles per element unscheduled due to stalls after the load and add:
Loop: ld x3, 0(x1)
add x3, x3, x4 // stall: load latency
sd x3, 0(x1) // stall stall: add latency
addi x1, x1, -8
bne x1, x2, Loop
Moving addi into the load’s delay slot and adjusting the store offset reduces this to 7 cycles with 2 stalls remaining:
Loop: ld x3, 0(x1)
addi x1, x1, -8 // fills load delay slot
add x3, x3, x4 // stall stall: add latency
sd x3, 8(x1) // offset adjusted for addi
bne x1, x2, Loop
Only 3 of those 7 cycles do useful work; the rest is loop overhead and stalls.