Multithreading

Cache misses to off-chip memory are unlikely to be hidden by ILP alone. When the processor stalls on a miss, functional unit utilization drops dramatically. A natural alternative is to exploit parallelism already present in applications: transaction processing has concurrency across queries, scientific workloads model inherently parallel physical structures, and desktop systems expose parallelism through multiple active applications.

A thread is an execution stream with its own PC and register state that shares the address space of a parent process. Multithreading is a hardware technique whereby multiple threads share a processor pipeline without an intervening OS process switch, using rapid switching or instruction interleaving to hide pipeline and memory latencies.

A multiprocessor runs independent threads in parallel across multiple full pipelines. Multithreading instead shares most of the core, duplicating only private state (register file and PC) while sharing memory via virtual memory. Thread switches must be far cheaper than OS process switches (hundreds to thousands of cycles) or user-level library switches (tens to hundreds of cycles).

Many modern processors combine both: multiple cores on a chip, each with multithreading support.

Hardware Approaches

Fine-Grained Multithreading

Switching Mechanism: Interleaves threads on every clock cycle, typically in round-robin, bypassing stalled threads.
Latency Hiding: Masks both short pipeline stalls and long-latency memory operations.
Throughput vs. Latency: Increases core throughput but degrades individual thread latency, since a ready thread is delayed by others.
Microarchitectural Impact: Pipeline control and forwarding logic must track thread identifiers alongside register addresses.
Examples: First used in the Denelcor HEP and Tera MTA supercomputers. SPARC T1–T5 (Sun/Oracle/Fujitsu) use fine-grained MT, targeting transaction processing and web services — T1 had 8 cores with 4 threads each; T5 had 16 cores with 128 threads each. NVIDIA GPUs also use fine-grained multithreading.

Coarse-Grained Multithreading

Switching Mechanism: Runs one thread until it hits a long-latency stall (e.g., an L2/L3 miss to off-chip memory), then switches. Other threads may be prefetched but are not executed until the stall occurs.
Single-Thread Performance: Minimizes interference since the pipeline executes a single thread at any point in time.
Pipeline Startup Overhead: Every context switch introduces a pipeline bubble as the new thread fetches from cold.
Limitation: Ineffective at hiding short stalls due to the startup penalty. No major current processors use coarse-grained multithreading.

Simultaneous Multithreading (SMT)

A dynamically scheduled superscalar already has many of the hardware mechanisms needed for multithreading, including a large virtual register set. SMT is what results from implementing fine-grained multithreading on top of such a processor. Fine-grained MT works well for single-issue designs where a cycle is either occupied or not, but SMT only makes sense for wide-issue designs where unused slots would otherwise go to waste. It uses TLP to fill those slots and weaves together ILP across multiple threads to utilize the pipeline as fully as possible.

Execution Decoupling: Instructions are fetched and issued from one thread at a time, but the dynamic scheduling hardware executes operations from multiple threads in the same clock cycle.
Dependence Resolution: Register renaming allows instructions from independent threads to be processed simultaneously without false dependences.
Resource Allocation:
- Static Partitioning: Dedicates pipeline resources per thread. Consistent but caps single-thread peak performance.
- Dynamic Sharing: Distributes pipeline entries based on per-thread ILP demand, maximizing total throughput.
- Critical Structure Replication: Small high-impact structures like the RAS are duplicated per thread.
Fetch Prioritization: The front-end fetches from the thread with the fewest pending instructions, naturally balancing fetch when one thread runs ahead.
Hardware Requirements: Building SMT on a superscalar requires per-thread renaming tables, separate PCs per thread, and the ability for instructions from multiple threads to commit.

Empty slots are white; shaded slots belong to different threads. Sun T1/T2 (Niagara) use fine-grained multithreading; Intel Core i7 and IBM Power10 use SMT, with T2 and Power10 supporting 8 threads and i7 supporting 2. The T2 issues two instructions per cycle but always from different threads, which avoids complex dynamic scheduling and relies purely on thread count to hide latency. In all existing SMT implementations, instructions issue from one thread at a time, but execution is decoupled, allowing operations from multiple threads to execute in the same cycle.

Without multithreading, issue slots are wasted due to ILP limits and long memory stalls. Coarse-grained MT reduces fully idle cycles by switching on costly stalls, but startup bubbles mean some idle cycles remain. Fine-grained MT eliminates fully empty slots by interleaving every cycle. SMT goes further: on a wide-issue processor it can fill slots across both the horizontal (issue width) and vertical (cycle) dimensions simultaneously.

SMT Performance and Energy

Throughput: On a dual-thread core (Intel Core i7), SMT achieves 1.28× speedup for multithreaded Java and 1.31× for PARSEC.
Energy: SMT increases dynamic power by keeping FUs more utilized. Since the static overhead of SMT structures is fixed, the speedup outpaces the power increase, yielding net energy savings on parallel workloads (7% reduction for PARSEC). Workloads with limited parallelism see minimal speedup and reduced energy efficiency.

Microarchitecture Side-Channel Attacks

Fundamentals of Transient Execution

Definition: Vulnerabilities that exploit speculation and multithreading to leak secret data are called microarchitecture side-channel attacks or transient execution attacks.
Mechanism: Similar to prime-probe cache attacks, but the spy controls instruction sequences executed between the prime and probe phases.
Bypass Capabilities: Speculation can access restricted data by bypassing software bounds checks, virtual memory protection, VM isolation, and hardware enclaves.
Footprint Generation: Speculatively executed instructions modify hardware state (e.g., cache lines) before the processor cancels them.

Impact of SMT

Pipeline Sharing: SMT lets a spy process share the pipeline directly with a victim.
High-Bandwidth Side Channels: Shared structures (L1 caches, BTBs) give the spy a high-bandwidth observation channel.
Rapid Probing: Parallel execution lets the spy probe the side channel before cache refills overwrite the leaked state.

Meltdown

Meltdown reads arbitrary kernel memory by bypassing virtual memory protection.

Prime: Allocate user_memory and flush it from the L1 cache.
Speculate and Leak: Trigger an exception (e.g., divide by zero). Before it resolves, speculatively load from a forbidden kernel address, mask a bit, and use it as an index into user_memory. That cache line loads before the processor cancels the instructions at the ROB head.
Probe: Time accesses to user_memory. A cache hit at index 0 means the secret bit was 0; a hit at index 256 means it was 1.

General Structure

A spy uses speculation to execute but not commit instructions that access secret data.
The secret acts as an index or operand that modifies a shared hardware structure (cache, BTB, memory disambiguation logic).
Attacks like Spectre invoke the victim via standard interfaces with crafted inputs to force speculative leakage.

Defenses

Software and Compiler:
- Fence Instructions: Insert fences that block speculation until conditions resolve.
- Speculative Load Hardening: Introduce a data dependency between the speculating instruction and the leaking load.
- Drawback: Requires recompilation and incurs a steep performance penalty.
Hardware and Microarchitectural:
- Resource Isolation: Disable SMT to prevent spy and victim from sharing structures.
- Cache Partitioning: Isolate cache space per process to block prime-probe tracking.
- Pipeline Modifications: Delay use of sensitive data under speculation, or reverse the impact of mis-speculated instructions on hardware state.
No Complete Defense: There is currently no set of defenses guaranteed to work against all possible attacks. Each mitigation addresses known attack patterns but leaves room for new variants.
The ISA Gap: Current ISAs define behavior in a timing-independent manner, leaving timing side-effects unspecified. Future architectures need new abstractions to formally model how timing interacts with security.

ARM Cortex-A53

Overview

High-efficiency core used in SoC designs, deployed extensively in big.LITTLE mobile configurations.
Dual-issue, statically scheduled superscalar with dynamic issue logic.

Pipeline

8-stage integer pipeline: F1, F2 (Fetch), D1, D2, D3/ISS (Decode/Issue), EX1, EX2 (Execute), WB (Writeback).
10-stage FP pipeline: 5 fetch/decode stages, 5 execution stages.
Strictly in-order: issue depends on result availability and initiation of all prior instructions.
Scoreboard-based issue logic tracks operand availability.

Branch Prediction

4-cycle fetch (F1–F4) with an AGU deriving the next PC by increment or prediction target.
Four-level branch prediction hierarchy:
- Branch Target Cache: Single-entry, checked in F1. 0-cycle penalty on correct prediction.
- Hybrid Predictor: 3072-entry, checked in F3. 2-cycle penalty on correct prediction.
- Indirect Branch Predictor: 256-entry, checked in F4. 3-cycle penalty on correct prediction.
- Return Stack: 8-deep, checked in F4. 3-cycle penalty on correct prediction.
Branch decisions evaluated in ALU pipe 0. Misprediction incurs an 8-cycle flush penalty.

Hazards

Ideal CPI of 0.5 under perfect dual-issue conditions.
Functional Hazards: Adjacent instructions requiring the same functional unit serialize at the execution unit entry. Requires compiler scheduling to minimize.
Data Hazards: Dependencies cause hardware interlocks and stalls until data resolves.
Control Hazards: Branch mispredictions flush the pipeline with an 8-cycle penalty.
Memory stalls: I-TLB/I-cache misses starve the instruction queue; D-TLB/D-cache misses stall execution directly.

Power Efficiency

Shallow pipeline and aggressive branch prediction constrain pipeline losses.
Consumes approximately 1/200th the power of contemporary high-performance quad-core processors, making it the efficient tier in big.LITTLE configurations.

Intel Golden Cove

Overview

Aggressive OOO speculative microarchitecture targeting maximum ILP for single-threaded execution.
Supports SMT with up to 2 threads per core.
Deployed in Alder Lake P-cores (client) and Sapphire Rapids (server).

Front-End

Fetches 32 bytes per cycle from the L1 instruction cache through 6 parallel decoders into a 144-entry micro-op queue.
Microcode engine handles complex x86 instructions at up to 4 micro-ops per cycle.
Micro-op Cache: 4K entries, delivers up to 8 micro-ops per cycle directly to the queue, bypassing instruction cache and decoders.
Loop Stream Detector: Detects loops fitting in the micro-op queue and disables upstream components to save power. In SMT mode, reallocates idle front-end resources to the second thread.
ITLB: 256 entries for 4 KB pages, 32 entries for 2/4 MB pages.
Branch Prediction: L2 BTB scales to 12K entries.

Execution Engine

512-entry ROB supports deep OOO execution windows.
Renames and issues up to 6 micro-ops per cycle. Zero and move idiom elimination resolves register clears and data moves without consuming scheduler resources.
Unified arithmetic scheduler: 97 entries shared across integer and FP/vector operations. Issues up to 5 micro-ops/cycle to integer units and 3 micro-ops/cycle to FP/vector units.
Supports AVX-512, BF16, and INT8. AMX defines an 8-register 2D file (16×64-byte rows); matrix multiply peaks at 2K INT8 ops per cycle.
Branch Order Buffer: 128 entries capture renaming snapshots at branches. On misprediction, hardware flushes younger instructions and restores the snapshot instantly — older instructions continue executing uninterrupted without waiting for the mispredicted branch to reach the ROB head.

Memory Subsystem

Throughput: 3×32-byte loads, 2×64-byte loads, and 2×64-byte stores per cycle.
Load buffer: 192 entries fed by a 70-entry scheduler through 3 AGUs. Store buffer: 114 entries fed by a 38-entry scheduler through 2 AGUs and 2 store data units.
Speculative memory disambiguation. Store forwarding supports partial-byte reads from pending stores.
L1 Data Cache: 48 KB, 12-way, 5-cycle load-to-use latency, 16 outstanding miss fill buffers.
L2 Cache: 1.25 MB (client) or 2 MB (server), 64 B/cycle, 15-cycle latency, 48 outstanding misses/prefetches.
TLBs: DTLB has 96 entries for 4 KB pages, 32 for 2/4 MB, 8 for 1 GB. STLB has 2K entries with a 4-parallel-walk PMH.

Performance

Single-thread: up to 5.5 GHz (client) or 4.2 GHz (server). CPI ranges from 0.32 to 0.89 (1.1–3.1 instructions per cycle).
SMT yields 5–46% higher throughput over single-threaded baseline, with the largest gains on workloads with unpredictable branches or random memory access patterns that underutilize the wide OOO core.
Enabling SMT across all cores increases L1/L2/L3 miss rates due to cache sharing, but parallel execution of the second thread still yields a net effective throughput gain.

My Knowledge Base

Explorer

5 Multithreading

Multithreading

Hardware Approaches

Fine-Grained Multithreading

Coarse-Grained Multithreading

Simultaneous Multithreading (SMT)

SMT Performance and Energy

Microarchitecture Side-Channel Attacks

Fundamentals of Transient Execution

Impact of SMT

Meltdown

General Structure

Defenses

ARM Cortex-A53

Overview

Pipeline

Branch Prediction

Hazards

Power Efficiency

Intel Golden Cove

Overview

Front-End

Execution Engine

Memory Subsystem

Performance