02 Virtualization

Foundations of Virtualization

Definitions:
- Virtual Machine (VM): An efficient, isolated duplicate of a physical machine.
- Hypervisor / Virtual Machine Monitor (VMM): The software layer that abstracts and manages physical hardware resources (the host) and maps them to virtual resources for the VMs (the guests).
Resource Management Techniques:
- Time sharing: Allocating CPU core cycles across multiple VMs.
- Partitioning: Dividing physical memory and CPU cores among VMs.
- Emulation: Mediating I/O calls through the hypervisor to interact with hardware on behalf of the VM.
Utility Computing Enablers:
- Aggregation: Multiplexing multiple distinct software stacks onto high-TCO physical servers to maximize utilization.
- Fault tolerance and maintenance: Encapsulating a VM’s state into a bitstream allows for snapshots, recovery, and live migration between physical hosts without downtime.
- Scheduling efficiency: Enables packing VMs into fewer hosts, freeing machines for low-power idle states or large exclusive workloads.
- Isolation: Contains software crashes, bugs, and security incidents within a single VM boundary.

This foundational software abstraction requires hardware to be inherently capable of strictly enforcing virtualization boundaries.

Architecture Virtualizability

Hypervisor Conditions: A VMM must guarantee Safety (total isolation and resource control), Equivalency (identical behavior to native hardware), and Efficiency (performance close to native hardware).
Popek and Goldberg Theorem: Defines an Instruction Set Architecture (ISA) as virtualizable if:
1. The ISA defines at least two operational modes (e.g., user and system).
2. The set of sensitive instructions (those that observe or alter system state) is a strict subset of privileged instructions (those executable only in system mode).
Legacy x86 Challenges: The original x86 ISA failed the Popek-Goldberg requirements by containing 17 sensitive but non-privileged instructions (e.g., using push to read the %cs register, allowing a guest OS to improperly observe its execution ring).
Early Software Mitigations:
- Dynamic Binary Translation (DBT): Hypervisors (e.g., VMware Workstation) dynamically scan and replace problematic instructions with safe, emulated sequences.
- Paravirtualization: The guest OS is modified to remove sensitive instructions and interact directly with the hypervisor (e.g., Xen).

Because software mitigations like DBT and paravirtualization introduce severe performance penalties, hardware extensions were introduced to manage execution modes natively.

Architectural Support for Virtualization

Motivations: Eliminate the need for DBT and paravirtualization, avoid mode/ring compression (which forces the guest OS and user applications into the same privilege ring, reducing isolation), and minimize transition overhead.
x86 VT-x Extensions:
- Orthogonal Modes: Introduces root mode (for the hypervisor) and non-root mode (for guest VMs), preserving standard rings (0 for guest OS, 3 for guest apps) within the non-root mode.
- Virtual Machine Control Structure (VMCS): A physical memory structure that encapsulates all configuration registers and state for a specific VM.
- Transitions: vmlaunch and vmresume load the VMCS and shift the processor into non-root mode.
- vmexit events: The hardware automatically traps back to root mode if the VM attempts to access sensitive root-mode state, receives an external interrupt, or explicitly invokes the hypervisor using vmcall.
RISC-V and ARM Extensions:
- RISC-V (H extension): Defines an H (hypervisor) mode that operates orthogonally to standard System (S) and User (U) modes.
- ARM: Implements EL2 as a strictly more privileged virtualization mode, and EL3 (TrustZone) as an orthogonal secure monitor mode.

Securing processor execution modes is only the first step; fully isolating virtual machines requires dedicated hardware to translate and map distinct memory spaces.

Memory Virtualization

Address Space Hierarchy: Virtualization creates three distinct memory layers: Guest Virtual Memory (GVM), Guest Physical Memory (GPM), and Host Physical Memory (HPM).
Shadow Page Tables: A legacy software approach where the hypervisor intercepts all OS memory management events to maintain a direct GVM-to-HPM mapping.
Hardware Extended Page Tables (EPT):
- The MMU and TLB natively process two sets of mappings: the guest-managed GVM $\to$ GPM tables, and the hypervisor-managed GPM $\to$ HPM tables.
- A hardware page table walker traverses both table structures without invoking the hypervisor or guest OS.
Translation Overhead:
- Nested translations exponentially increase the number of memory accesses on a TLB miss.
- For a 64-bit architecture using 4-level page tables, an EPT TLB miss requires: $Total Memory Accesses = 4 (Guest Walk) + (5 \times 4) (EPT Walk) = 24$
- Mitigations: Large pages (e.g., 2 MiB) and dedicated MMU caches are essential to increase TLB coverage and reduce the depth of the hardware page walk.

With memory accesses safely translated in hardware via EPTs, I/O devices must also be securely mapped to allow direct hardware interaction from guest virtual machines.

I/O Virtualization

Baseline Interception: Historically, hypervisors intercepted all Programmed I/O (PIO), Memory-Mapped I/O (MMIO), and interrupts. While this enables live migration, it causes severe latency and throughput bottlenecks for high-speed network and NVMe devices.
Direct Access Targets: The goal is to allow the guest OS to program I/O via MMIO, enable the device to copy data directly via DMA, and allow the device to interrupt the VM directly.
IOMMU (e.g., Intel VT-d):
- DMA Remapping (DMAR): Translates DMA memory accesses from GPM to HPM. The IOMMU uses the PCIe device identity (8-bit bus, 5-bit device, 3-bit function) to look up a context identifier, which points to the correct translation page tables. Translations are cached in a dedicated IOMMU TLB.
- Interrupt Remapping (IR): Uses the device identifier to locate the target VMCS and update its Virtual APIC. This enables posted interrupts—exitless interrupt delivery directly to the running VM without waking the hypervisor.
Single-Rooted I/O Virtualization (SR-IOV): Divides a physical PCIe device into one Physical Function (PF, managed by the hypervisor) and multiple Virtual Functions (VFs, mapped directly into the GPM of guest VMs for direct access).

Hardware mechanisms for CPU, EPT memory, and IOMMU routing are orchestrated by specialized software layers known as hypervisors.

Hypervisor Implementations (The Xen Project)

Classifications: Type-1 hypervisors (bare-metal) run in the root mode ring 0 directly on hardware. Type-2 hypervisors (hosted) run as an application on top of a host OS.
Xen Architecture:
- Dom0 (Privileged domain): Boots first, manages regular guest VMs, and runs backend hardware I/O drivers.
- Guest domains: Run frontend drivers that call into Dom0 to complete I/O requests.
- Driver domains: Unprivileged VMs responsible for specific I/O devices, utilized to isolate driver crashes and prevent Dom0 from becoming a system bottleneck.
Evolution: Originally relied heavily on paravirtualization (modifying the Linux kernel and mapping Xen into the upper 64 MiB of VM address space to avoid TLB flushes). Xen later evolved to fully leverage VT-x and VT-d, allowing it to run unmodified operating systems efficiently.

Even with a robust hypervisor isolating workloads, executing sensitive workloads on shared public infrastructure requires ultimate data protection at the silicon level.

Confidential Computing and Secure Enclaves

Purpose: Provides strong guarantees of data confidentiality and integrity against compromised hypervisors, malicious host operating systems, or physical hardware bus probes.
Secure Enclaves (e.g., Intel SGX):
- Code and data remain encrypted in HPM using keys provided by the user code.
- Decryption occurs exclusively inside the CPU die during active execution.
- Attestation: The hardware generates a cryptographic hash of the enclave’s initial state, allowing an external service to verify that the code and data have not been tampered with.
Execution Constraints: Enclaves do not process interrupts, page faults, or vmexit events. If interrupted, the processor saves the enclave state to encrypted memory, overwrites registers with synthetic values, exits secure mode, and invokes standard system handlers.

These comprehensive hardware protections ensure that individual virtual environments remain secure, performant, and verifiable components of the overarching distributed system.

My Knowledge Base

Explorer