Overcoming Memory Dependences with Dynamic Disambiguation
The Challenge of Memory Dependences
Data dependences through memory locations present a greater scheduling challenge than register dependences.
- Identification latency: Register names are fully encoded in instructions, allowing early dependence checks. Memory addresses must be calculated by an address generation unit first.
- Calculation stalls: Address calculation itself may stall due to register dependences on older arithmetic or load instructions that manipulate pointer values.
- In-order limitation: A naive solution forces all load and store instructions to execute in program order. This stalls cache access for loads in subsequent loop iterations until current-iteration stores commit, severely limiting Instruction-Level Parallelism (ILP).
To overcome the performance penalty of strict in-order memory execution, processors implement mechanisms to safely reorder memory operations as soon as data dependencies allow.
Load Bypassing and Store Forwarding
Processors dynamically schedule memory operations by calculating effective addresses as early as possible while strictly holding store commits until instruction retirement.
- Load Bypassing: A load instruction evaluates the effective addresses of all older, pending stores in the store buffer.
- If the load address does not match any older store addresses, it bypasses the stores and accesses the cache immediately.
- This hides cache miss latency and resolves data dependences for subsequent arithmetic instructions.
- Store Forwarding: If a load address matches the address of an older pending store, a data dependence exists.
- The load retrieves its data directly from the input physical register of the youngest matching older store.
- This bypasses the memory hierarchy entirely, accelerating execution (especially critical in architectures with few registers, like x86, which frequently spill and restore variables).
Both load bypassing and store forwarding rely on the exact knowledge of all older store addresses. Waiting for these addresses to be calculated reintroduces pipeline stalls, necessitating predictive techniques.
Speculative Memory Disambiguation
To prevent loads from stalling behind older stores with uncalculated addresses, processors use speculative memory disambiguation to guess dependency statuses.
- Naive Speculation: A load is allowed to bypass older stores with pending addresses by speculating that no data dependence exists.
- Validation: Once the pending store calculates its address, the hardware checks for a match against the bypassed load.
- Mispeculation Recovery: If a dependence is discovered after the fact, the speculation is incorrect. The processor must flush the entire pipeline after the offending load and re-execute the load and all dependent instructions.
Because the penalty for flushing the pipeline is extremely high, naive speculation often performs worse than no speculation. The processor must selectively predict which specific loads and stores are safe to reorder.
Selective Speculation via Store Sets
Modern out-of-order processors learn memory dependence patterns to apply speculative memory disambiguation selectively. A primary implementation of this is the Store Sets scheme, which tracks instructions recently involved in memory dependence mispeculations.
- Hardware Structures:
- Store Set ID Table (SSIT): Indexed by bits from the of load and store instructions. Maps an instruction to a short integer Store Set ID (e.g., 0–63). All entries initialize to invalid.
- Last Fetched Store Table (LFST): Indexed by the Store Set ID. Tracks the of the last fetched store instruction belonging to a specific store set.
- Operation Steps:
- On Mispeculation: The hardware allocates a Store Set ID and inserts it into the SSIT for both the load and the store that conflicted. If either already belongs to a set, that ID is used. If both belong to different sets, the sets are merged.
- On Store Dispatch: The store checks its SSIT entry. If valid, it records its in the corresponding LFST entry. Once the store calculates its effective address, the LFST entry is cleared.
- On Load Dispatch: The load calculates its address and checks the SSIT/LFST.
- Stall Condition: If the load’s SSIT is valid and points to an LFST entry containing a pending store without a calculated address, the load is stalled.
- Speculate Condition: If the SSIT is invalid (no recent mispeculation history) or the LFST is empty (all relevant older stores have known addresses), the load safely bypasses the older stores.
While the Store Sets scheme effectively identifies problematic dependencies, its history-based tracking introduces structural and scaling limitations over time.
Limitations and Extensions of Memory Disambiguation
The hardware structures governing speculative disambiguation require maintenance to remain accurate and performant.
- Store Set Limitations:
- Aliasing: Different loads and stores may map to the same SSIT entry, causing independent instructions to be falsely merged into the same store set.
- Stale History: Store sets only grow over time, leading to overly conservative scheduling as infrequent dependencies permanently restrict load bypassing.
- Resolution: Processors periodically clear the SSIT and LFST (e.g., every few million instructions), forcing the hardware to rediscover only the currently critical dependences.
- Multiprocessor Constraints: Multiprocessor systems enforce memory consistency models that introduce additional ordering constraints on loads and stores, even after physical addresses are known. The hardware must learn when reordering risks consistency violations and restrict speculation accordingly.
- Value Prediction Relation: Speculative memory disambiguation is a highly restricted form of value prediction (guessing the actual data output of an instruction). While disambiguation only predicts the absence of a memory address match, generalized value prediction attempts to guess exact values to break dataflow restrictions—a concept that remains largely impractical for commercial processors due to accuracy and hardware costs.
Would you like me to elaborate on the memory consistency models that constrain this disambiguation, or explore the earlier pipeline stages that handle register renaming?