4 Static Superscalars

VLIW

Very Long Instruction Word (VLIW) architectures abandon dynamic hardware scheduling in favor of static, compiler-driven ILP extraction.

Instruction Formatting: Multiple independent operations are packaged into a single, fixed-width instruction or instruction packet.
- A typical VLIW instruction might encode five distinct operations (e.g., one integer/branch, two floating-point, two memory references) spanning 80 to 120 bits.
Static Scheduling: The compiler guarantees that all operations within a VLIW instruction are independent and structurally compatible.
- Local Scheduling: Applies to straight-line code generated by aggressively unrolling loops to fill available operation slots.
- Global Scheduling: Moves code across branch boundaries (e.g., trace scheduling) when basic blocks lack sufficient ILP, utilizing complex optimization trade-offs.

Limitations

Relying entirely on the compiler to format and schedule parallel execution exposes severe limitations in code density, pipeline synchronization, and lifecycle compatibility.

Code Bloat: VLIW binaries suffer from massive code size expansion compared to equivalent dynamic superscalar binaries.
- Aggressive loop unrolling physically replicates instruction sequences to expose ILP.
- Unused functional units mandate the insertion of explicit NOPs (no-operations) within the wide instruction word, resulting in wasted encoding bits.
- Mitigations include sharing one large immediate field across all functional units, or compressing instructions in main memory and expanding them at cache fill or decode time.
Lockstep Execution: Early VLIW designs lacked hardware hazard detection, forcing functional units to execute in strict lockstep.
- A stall in any single functional unit (e.g., due to a data cache miss) forced the entire processor pipeline to halt to maintain synchronization.
- Modern implementations decouple functional units and rely on hardware checks to permit unsynchronized execution after the initial issue stage.
Register Pressure: VLIW’s aggressive unrolling and static scheduling keeps far more values live simultaneously than a superscalar would. An equivalent loop body can require roughly twice as many registers in a VLIW schedule as in a dynamically scheduled implementation.
Loss of Binary Compatibility: VLIW code is tightly bound to the precise physical parameters of the target hardware.
- Instructions are scheduled against specific functional unit latencies and execution port counts.
- Modifying hardware parameters in newer processor generations breaks the static schedule, requiring complete recompilation of the software stack to maintain performance or functional correctness.

EPIC

The Explicitly Parallel Instruction Computing (EPIC) approach, of which IA-64 is the primary example, attempts to address the core VLIW problems. It adds hardware support for more aggressive software speculation, relaxes the strict lockstep constraint, and introduces mechanisms to preserve binary compatibility across implementations with different issue widths and latencies.

My Knowledge Base

Explorer

4 Static Superscalars

VLIW

Limitations

EPIC