Advanced Branch Prediction

Deep pipelines and multiple-issue processors require highly accurate branch prediction to mitigate the severe performance limitations imposed by control dependences. Effective dynamic branch prediction requires determining both the branch direction (taken or not taken) and the target address.

To improve prediction accuracy beyond the capabilities of simple local history trackers, modern architectures capture the structural relationships between distinct branches.

Correlating Branch Predictors

Branch outcomes are frequently correlated; the execution path of a specific branch is often determined by the outcomes of recently executed neighboring branches.

$(m, n)$ Predictor Framework: Uses the behavior of the last $m$ branches to choose from $2^{m}$ prediction tables, where each table entry is an $n$ -bit saturating counter.
Hardware Implementation:
- Maintains an $m$ -bit shift register to record global branch history, logging each recent branch as taken or not taken.
- The prediction buffer is indexed by combining (via concatenation or hashing) the lower-order bits of the branch address with the $m$ -bit global history.
gshare Predictor: A widely used correlating predictor that forms its index by combining the branch address and global branch history using an exclusive-OR (XOR) hash.

While global history captures inter-branch dependencies, certain branches display highly repetitive local behaviors, requiring a system that can dynamically select the most appropriate prediction strategy.

Tournament Predictors

Tournament predictors adaptively combine multiple prediction mechanisms—typically a global predictor and a local predictor—and dynamically select the most effective one for a given branch.

Selector Mechanism: Utilizes an array of 2-bit saturating counters indexed by the branch address. The counter tracks recent prediction accuracy, requiring two consecutive mispredictions to switch the active predictor from local to global, or vice versa.
Local Predictor Architecture: Often implemented as a two-level structure to capture specific branch patterns:
- Level 1: A local history table where each entry records the exact sequence of the most recent outcomes for a specific branch (e.g., a 10-bit entry for the last 10 outcomes).
- Level 2: A table of $n$ -bit saturating counters indexed by the local history.
- This dual-level local component achieves 100% accuracy on low-iteration loops by perfectly recognizing terminating patterns.

Despite the adaptability of tournament predictors, selecting a single optimal global history length remains challenging; short histories miss complex patterns, while long histories slow down training and increase destructive aliasing collisions.

Tagged Hybrid Predictors (TAGE)

Tagged hybrid predictors resolve the history length dilemma by simultaneously employing multiple global predictors indexed with different history lengths and utilizing tags to prevent destructive aliasing.

Structure:
- Features a base predictor (typically $P (0)$ ) and a series of independent prediction tables ( $P (1)$ through $P (n)$ ).
- Tables $P (1)$ through $P (n)$ are accessed using a hash of the Program Counter (PC) and varying, increasing lengths of global branch history.
- Each entry contains an $n$ -bit prediction counter, a tag (typically 4–8 bits) to verify a hash match, and a 2-bit “use” field to track recent utility.
Selection Logic: The active prediction is sourced from the table utilizing the longest history length that results in a successful tag match. If no tags match across $P (1)$ through $P (n)$ , the base predictor $P (0)$ provides the default prediction.
Initialization: To minimize the training time required to populate large prediction structures, entries can be initialized using static ISA hints or predefined heuristics (e.g., backward-going branches initialized as taken, forward-going as not taken).

Accurately predicting the branch direction is insufficient for zero-stall execution; the processor must also immediately know where the branch transfers control.

Branch-Target Buffers (BTB)

A Branch-Target Buffer (BTB) or branch-target cache eliminates the branch penalty by caching the predicted target address for the next instruction following a branch.

Operation:
- The BTB is accessed in parallel with the branch direction predictors using the PC of the currently fetched instruction.
- Entries are tagged with the branch PC to prevent utilizing a target address belonging to a different branch in the event of an index conflict.
- If the PC matches a BTB tag and the branch is predicted “taken,” instruction fetch is immediately redirected to the cached target address, resulting in zero penalty cycles.
- If the branch is predicted “taken” but misses in the BTB, the pipeline must stall fetching until the current instruction is decoded and its target address is manually calculated.
Branch Folding: An optimization where the BTB stores one or more actual target instructions rather than just the target address. This allows the hardware to substitute the unconditional branch instruction with the target instruction directly, enabling 0-cycle unconditional branches.

While the BTB efficiently resolves static target addresses, indirect control transfers require highly specialized architectural structures.

Specialized Predictors for Control Flow

Return Address Stack (RAS): Procedure returns are a highly frequent class of indirect jumps whose targets vary dynamically based on the call site. The RAS mitigates this by operating as a hardware cache that pushes the return address on a procedure call and pops it on a procedure return, yielding near-perfect prediction if the stack accommodates the call depth.
Indirect Jumps: Targets for structures like switch statements, virtual functions, and function pointers vary at runtime and require dedicated indirect branch predictors or reliance on the BTB.
Loop Branches: Older architectures (like the Intel Core i7 920) utilized dedicated loop exit predictors equipped with counters to predict exact iteration counts. In modern designs, large tagged hybrid predictors effectively subsume this functionality, rendering dedicated loop predictors unnecessary.

To orchestrate these sophisticated prediction mechanisms simultaneously with high-bandwidth instruction retrieval, modern processors isolate them into a dedicated front-end architectural unit.

Integrated Instruction Fetch Units

To satisfy the extreme bandwidth demands of multiple-issue processors, the instruction fetch stage is implemented as an autonomous, decoupled front-end unit that feeds instructions to the execution pipeline.

Integrated Branch Prediction: The branch predictor operates internally within the fetch unit, utilizing the BTB and prediction tables to predict multiple branches per fetch bundle autonomously.
Instruction Prefetch: The unit manages hardware prefetching independently, utilizing branch prediction data to hide memory access latency.
Instruction Memory Buffering: The fetch unit encapsulates the complexity of crossing cache blocks during wide fetches, acting as an on-demand buffer that supplies instructions to the issue stage in the required quantities.

My Knowledge Base

Explorer

05 Branch Prediction