02 ILP

Fundamentals of Instruction-Level Parallelism (ILP)

Instruction-level parallelism (ILP) is the potential overlap in the execution of instructions within a program. Maximizing ILP is the primary method for improving uniprocessor performance. There are two largely separable approaches to discovering and exploiting ILP:

Dynamic (Hardware-based): Relies on hardware mechanisms to discover and exploit parallelism at runtime. This approach dominates desktop, server, and mobile processor designs.
Static (Software-based): Relies on compiler technology to find parallelism statically at compile time. This approach is primarily successful in domain-specific architectures or well-structured scientific applications.

Understanding the mathematical components of execution time highlights exactly where these dynamic and static approaches focus their optimization efforts.

The Pipeline Performance Framework

The performance of a pipelined processor is governed by the cycles per instruction (CPI), which is the sum of the base theoretical CPI and all contributions from pipeline stalls:

$P i p e l in e CP I = I d e a lp i p e l in e CP I + St r u c t u r a l s t a ll s + D a t aha z a r d s t a ll s + C o n t ro l s t a ll s$

Ideal pipeline CPI: A measure of the maximum performance attainable by the hardware implementation.
Structural stalls: Delays caused by hardware resource limitations preventing concurrent instruction execution.
Data hazard stalls: Delays required to maintain correct execution order when instructions depend on the results of previous instructions.
Control stalls: Delays caused by branch instructions determining the flow of the program.

Minimizing these pipeline stalls requires uncovering sufficient independent instructions, a task inherently constrained by the sequential structure of basic program code.

Exposing Parallelism Across Basic Blocks

The amount of parallelism available within a single basic block—a straight-line code sequence with no branches in except at the entry and no branches out except at the exit—is severely limited.

Branch Frequency: In typical RISC programs, average dynamic branch frequency is between 15% and 25%, meaning only three to six instructions execute between a pair of branches.
Loop-Level Parallelism: To obtain substantial performance enhancements, ILP must be exploited across multiple basic blocks, most commonly by overlapping independent iterations of a loop. This is achieved by unrolling the loop either statically via the compiler or dynamically via hardware.
Data-Level Parallelism (DLP): An alternative to loop unrolling is using Single Instruction, Multiple Data (SIMD) or vector architectures, which apply a single parallel instruction to a collection of data items, replacing hundreds of sequential instructions with a few deeply pipelined operations.

To safely exploit parallelism across these block and loop boundaries, the system must rigidly respect the computational relationships between data elements.

Data Dependences

Determining how one instruction depends on another is critical for establishing which instructions can execute in parallel. An instruction $j$ is data dependent on a preceding instruction $i$ if either of the following conditions holds:

Instruction $i$ produces a result that may be used by instruction $j$ .
Instruction $j$ is data dependent on instruction $k$ , and instruction $k$ is data dependent on instruction $i$ (forming a dependence chain).

A data dependence conveys three specific properties about a program:

Hazard Potential: The possibility that an actual hazard will occur in the pipeline.
Execution Order: The strict sequence in which results must be calculated to preserve correctness.
Parallelism Limits: An absolute upper bound on how much parallelism can possibly be exploited.

Dependences are a property of the programs themselves, whereas hazards and stalls are properties of the pipeline organization. Overcoming data limitations requires either scheduling the code to avoid the resulting hazards or transforming the code to eliminate the dependence entirely.

Not all instruction relationships involve the actual computational transfer of values; some arise strictly from the reuse of hardware storage locations.

Name Dependences

A name dependence occurs when two instructions use the same register or memory location, but there is no actual flow of data between the instructions. There are two types of name dependences:

Antidependence (WAR - Write After Read): Occurs when instruction $j$ writes to a register or memory location that instruction $i$ reads. The original order must be preserved to ensure instruction $i$ reads the correct, older value.
Output dependence (WAW - Write After Write): Occurs when instruction $i$ and instruction $j$ write to the same register or memory location. The ordering must be preserved to ensure the location retains the final value written by instruction $j$ .

Because name dependences are not true data dependences, they can be completely eliminated by register renaming. This involves allocating a new, distinct register for the reused name, which can be done statically by a compiler or dynamically by the hardware.

While data and name dependences govern the flow and storage of variables, another class of dependence dictates whether an instruction should be executed at all.

Control Dependences

A control dependence determines the ordering of an instruction with respect to a branch, ensuring that the instruction executes only when the program flow dictates it should. For example, the execution of statements within the “then” portion of an if statement is control dependent on the evaluation of the branch condition.

Preserving strict program order inherently preserves control dependences. When optimizing, however, two critical properties must remain intact even if control dependence rules are bent:

Exception Behavior: Reordering instructions must not cause an exception that would not have occurred in the strict, original program order.
Data Flow: The actual flow of data values from producers to consumers must be maintained. Ignoring a control dependence could result in an instruction receiving a value from the wrong execution path.

Strictly enforcing these control boundaries severely limits instruction overlap, necessitating mechanisms that can safely look ahead past unresolved branches.

Speculation

Speculation is a technique that allows the processor or compiler to overcome control dependence limitations by guessing the outcome of a branch and executing instructions before the branch is fully resolved.

Hardware Speculation: The processor makes dynamic bets on branch outcomes based on execution history.
Software Speculation: The compiler statically reorders instructions across branches, betting on the most likely execution path.

If the speculation is correct, the pipeline avoids a costly control stall. If the bet is incorrect, the hardware or software must include a recovery mechanism to cancel the speculatively executed instructions, preventing illegal changes to the data flow and isolating any premature exceptions.

My Knowledge Base

Explorer