Cross-Cutting Issues: The Design of Memory Hierarchies

Autonomous Instruction Fetch Units

  • Deep pipelines and out-of-order execution rely on decoupled instruction fetch and decode units.
  • Fetch units access the instruction cache to retrieve entire blocks prior to decoding individual instructions.
  • Block-level fetching satisfies the bandwidth demands of multiple-issue processors and accommodates variable-length instructions,.
  • Hardware prefetching directly impacts memory hierarchy metrics:
    • Fetch units proactively prefetch blocks into the L1 instruction cache.
    • Data prefetching mechanisms operate similarly for the data cache.
    • Prefetching increases the absolute cache miss rate but decreases the total effective miss penalty.
    • Miss rates of processors with autonomous fetching cannot be directly compared to processors that fetch strictly per-instruction.

Special Instruction Caches

  • Superscalar processors face critical bottlenecks in supplying instruction bandwidth.
  • Processors that translate instructions into microoperations (e.g., Arm, Intel i7) utilize a dedicated microoperation cache.
  • Caching recently translated microoperations significantly reduces both instruction fetch latency and branch misprediction penalties.

Speculation and Memory Access

  • Speculative execution relies on branch prediction to execute instructions before verifying their necessity.
  • Protection Hazards: Speculative memory accesses can trigger protection exceptions. Hardware must suppress these exceptions unless the speculative instruction is confirmed to graduate and commit.
  • Performance Trade-offs:
    • Speculative accesses to both instruction and data caches increase the baseline cache miss rate,.
    • Despite the higher miss rate, speculation often lowers the total overall cache miss penalty by initiating necessary memory fetches early,.
    • Comparing miss rates between speculative and non-speculative processors is fundamentally misleading due to these extraneous accesses.
  • Security Vulnerabilities: Speculative memory accesses dramatically increase the leakage rate exploited by side-channel attacks.

Coherency of Cached Data

  • Memory inconsistencies occur when multiple processors or I/O devices interact with shared cached data.
  • Multiprocessor Coherency: Shared-memory multiprocessors must maintain consistent copies of the same data across multiple distinct caches.
  • I/O Coherency: Interactions between I/O devices and memory require strict management to prevent reading or writing stale data.
    • I/O via Cache: Routing I/O directly through the cache guarantees consistency but stalls the processor and evicts valuable cache blocks,.
    • I/O via Main Memory: Routing I/O to main memory (acting as an I/O buffer) avoids processor interference but requires mechanisms to handle stale cache data.
      • Write Policies: Write-through caches natively keep main memory updated, but modern memory hierarchies typically pair write-through L1 caches with write-back L2 caches,.
      • Write Merging Limitations: Memory-mapped I/O registers demand precise single-address accesses and fail if writes are merged. Hardware handles this by marking specific I/O pages as requiring nonmerging write-through,.
      • Software Invalidation (Input): Operating systems prevent stale data by marking I/O input pages as noncacheable or by explicitly flushing targeted buffer addresses before input occurs,.
      • Hardware Invalidation (Input/Output): Hardware dynamically checks I/O addresses against cache tags, invalidating matching cache entries to guarantee memory consistency.

Protection via Virtual Machines (VMs)

  • Virtual machines enforce protection by running independent, isolated operating systems concurrently on shared hardware.
  • VMs rely on shadow page tables for address translation, compounding the complexity of memory management,.
  • Shadow page tables significantly increase the cost of Translation Lookaside Buffer (TLB) misses by requiring more complex address mapping lookups.
  • Modern architectures integrate dedicated hardware mechanisms to accelerate the complex steps required during a VM-induced TLB miss.