Compiler Techniques for Exposing ILP
Basic Pipeline Scheduling
To keep a pipeline full, a compiler must find sequences of independent instructions that can be overlapped during execution.
- To avoid a pipeline stall, a dependent instruction must be separated from its source instruction by a distance in clock cycles equal to the pipeline latency of that source,.
- The compiler’s ability to schedule these instructions depends on the amount of instruction-level parallelism (ILP) available in the program and the specific latencies of the pipeline’s functional units,.
- Within a standard basic block, the amount of available ILP is often insufficient to fully hide operation latencies, necessitating analysis across branch boundaries,.
To systematically find more independent instructions for scheduling, compilers look beyond single basic blocks and extract parallelism from iterative structures.
Loop Unrolling
Loop unrolling increases the size of straight-line code fragments to expose greater amounts of ILP,,.
- Unrolling replicates the loop body multiple times and adjusts the loop termination code.
- It effectively increases the number of data operations relative to the branch and loop maintenance overhead instructions.
- When the upper bound of a loop, , is unknown at compile time, the compiler must dynamically handle the unrolling factor, .
- The compiler handles unknown bounds by generating two consecutive loops:
- A standard, non-unrolled loop that executes times.
- An unrolled loop containing copies of the body, surrounded by an outer loop that iterates times.
- This technique is functionally similar to strip mining.
Replicating the loop body creates structural duplicates, which inherently introduces artificial dependencies among variables that must be resolved to achieve true parallelism.
Eliminating Name Dependences
Unrolling a loop without altering the registers used in the body creates name dependences, specifically antidependences and output dependences.
- Using the exact same architectural registers for multiple unrolled iterations forces sequential execution despite the iterations being logically independent,.
- The compiler performs static register renaming by assigning distinct physical registers to the computations of each unrolled iteration,,.
- Removing these name dependences prevents unnecessary hardware constraints and allows instructions from different iterations to be freely interleaved during scheduling.
With artificial register dependencies removed, the compiler must concurrently address address-calculation dependencies and perform memory disambiguation to safely reorder the operations.
Transformations and Memory Disambiguation
Transforming an unrolled loop into a stall-free schedule requires a methodical series of analytical and structural adjustments,.
- Iteration Independence: The compiler must first prove that the loop iterations are independent, aside from the standard loop maintenance code.
- Symbolic Substitution and Simplification: Expressions are rearranged and constants are folded to simplify address calculations.
- Base memory addresses are statically compensated with specific offsets (e.g.,
-8,-16) for each unrolled iteration. - This compensation allows the compiler to collapse multiple independent pointer decrement or increment operations into a single pointer update,,.
- Base memory addresses are statically compensated with specific offsets (e.g.,
- Memory Disambiguation: The compiler must analyze memory addresses to guarantee that loads and stores from different iterations do not alias or refer to the same address.
- Code Scheduling: Independent loads and stores are interchanged, and operations are scheduled to preserve necessary data dependences while eliminating pipeline stalls.
While these transformations effectively eliminate stalls and overhead, aggressive unrolling imposes secondary physical and architectural constraints on the target processor.
Limits to Loop Unrolling
The theoretical performance gains of loop unrolling are strictly bounded by three primary architectural and program effects.
- Diminishing Amortization Returns: As the unroll factor increases, the incremental performance benefit of eliminating loop overhead decreases.
- Code Size Limitations: Aggressive unrolling significantly expands the program footprint, which can exceed the instruction cache capacity and cause performance-degrading cache misses,.
- Register Pressure: Scheduling code to maximize ILP drastically increases the number of concurrent live values.
- If the number of live values exceeds the number of available architectural registers, the compiler is forced to spill values to memory and reload them later.
- The memory access latency introduced by register spilling can entirely negate the performance advantages gained from the unrolled schedule.