Virtualization

Virtualization was first developed in the late 1960s for mainframe computing. It was largely ignored for single-user and server machines through the 1980s–90s, then revived in the 2000s — first to let developers run multiple OS variants on one machine for testing without buying extra hardware, then for enterprise server consolidation (higher utilization, running old software stacks on new hardware). These enterprise use cases directly inspired cloud computing at WSC scale, with virtualization as the core enabling technology.

Foundations of Virtualization

Definitions:
- Virtual Machine (VM): An efficient, isolated duplicate of a physical machine. The software layer managing VMs is the hypervisor or Virtual Machine Monitor (VMM); it maps physical host resources to virtual guest resources.
- Type-1 (bare-metal): Hypervisor boots directly on hardware — no host OS. Runs in root mode ring 0. WSCs exclusively use type-1.
- Type-2 (hosted): A full host OS owns the machine in root mode ring 0. The hypervisor runs as a user-space application on that host OS — in root mode ring 3. Common on personal computers (e.g., VMware Workstation, VirtualBox).
- System VMs (focus here): The guest ISA matches the underlying hardware ISA — the VM presents the illusion of an entire computer. Examples: IBM VM/370, VMware ESX Server, Xen.
- Language VMs: Define “machine” as the user-level ABI rather than the ISA. Examples: Java VM, JavaScript engines. Not covered further here.
Resource Management Techniques:
- Time sharing: CPU cores shared across VMs, like an OS time-sharing cores across processes.
- Partitioning: Physical memory and CPU cores divided among VMs; each VM accesses a subset of pages.
- Emulation: I/O calls intercepted and mediated by the hypervisor, which accesses devices and returns results to the VM.
Hypervisor Scope: The core isolation layer is ~10,000 lines of code — far smaller than a traditional OS because it omits upper-half functionality: no graphics, UI, high-level APIs, or libraries. All popular ISAs now include hardware support for hypervisors. Cloud providers are also increasingly offloading hypervisor functions to SmartNICs, freeing host CPU cycles for client VMs.
Utility Computing Enablers:
- Aggregation: Multiplexing multiple distinct software stacks onto large physical servers to maximize utilization.
- Fault tolerance and maintenance: A VM’s full state is a collection of bits that can be saved (snapshots), copied to another host (live migration for hitless maintenance), or replicated for active/passive fault tolerance.
- Scheduling efficiency: VMs can be packed into fewer hosts to enable low-power idle states on freed machines, reserved for large exclusive workloads, or migrated to load-balance the WSC network and storage. VMs causing performance interference can be spaced apart.
- Isolation: Crashes, bugs, and security incidents in one VM are contained within its boundary and cannot affect co-located VMs.

Architecture Virtualizability

Hypervisor Conditions: A VMM must guarantee three properties:
- Safety: The hypervisor is isolated from all guests; guests are isolated from each other; guests cannot directly alter physical resource allocation.
- Equivalency: Software in a guest VM behaves exactly as on native hardware — identical outputs including side effects excluding differences caused by performance or resource availability (e.g., amount of physical memory).
- Efficiency: Guest performance is as close to native as possible; overheads come from shared/partitioned resources and multiplexing.
Why the hypervisor needs higher privilege: The hypervisor must control access to privileged state, address translation, I/O, exceptions, and interrupts — even while a guest is actively using them. Example: on a timer interrupt the hypervisor suspends the running VM, saves its state, handles the interrupt, picks the next VM, and loads its state. The guest is provided a virtual timer and an emulated timer interrupt instead of the real one. System state — all configuration registers and memory structures managing processor state, interrupts, MMU, security, and resource allocation — must be exclusively under hypervisor control.
Popek and Goldberg Theorem (1974): An ISA is virtualizable if:
1. It defines at least two modes: system mode and user mode.
2. Sensitive instructions (those that observe or alter system state) are a strict subset of privileged instructions (executable only in system mode).
- When these conditions hold, any sensitive instruction executed in user mode traps to system mode, giving the hypervisor — running alone in system mode — full control to intercept and virtualize as needed.
Legacy x86 Challenges: The original x86 ISA had 17 sensitive but non-privileged instructions. Example: push allowed a guest OS running in ring 1 or ring 3 (x86 user modes) to read %cs, which encodes the current privilege ring — letting the guest observe it was not in kernel mode (ring 0).
Early Software Mitigations:
- Dynamic Binary Translation (DBT): VMware Workstation scanned pages of guest code as they loaded, replacing problematic instructions with safe, emulated sequences.
- Paravirtualization: The guest OS was modified to eliminate problematic instructions and call the hypervisor directly (Xen).
- Both techniques remain useful for addressing performance shortcomings even in hardware-supported environments.

Architectural Support for Virtualization

Motivations: Three factors drove hardware support:
1. Ensure all sensitive instructions are privileged, eliminating the need for DBT and paravirtualization.
2. Avoid mode/ring compression — without a dedicated hypervisor mode, the hypervisor occupies system mode, forcing the guest OS and applications to share user mode, reducing their isolation.
3. Reduce excessive VM↔hypervisor transitions caused by unmodified OSes frequently executing sensitive instructions.
x86 VT-x Extensions:
- Orthogonal Modes: Adds root mode (hypervisor) orthogonal to existing x86 rings. Guest VMs run in non-root mode with OS in ring 0 and apps in ring 3 — no changes to the guest’s hardware view, no mode compression.
- VMCS (Virtual Machine Control Structure): Physical memory structure holding all configuration registers and state for one VM. The hypervisor uses it to save, restore, read, write, and schedule VMs — analogous to how an OS manages process state.
- vmlaunch / vmresume: Atomically copy VMCS into processor registers and switch to non-root mode to run the VM.
- vmexit events: Hardware traps to root mode when the VM (1) accesses root-mode sensitive state or uses a root-mode privileged instruction, (2) receives an external or unhandleable interrupt, or (3) calls the hypervisor via vmcall. vmcall is to the hypervisor what syscall is to the OS.
- Most instructions don’t cause vmexit: Hardware provides each VM duplicate copies of sensitive state, so the guest OS can read and write its configuration registers without invoking the hypervisor. Only accesses that would affect root-mode state trap out. This is how VT-x satisfies Popek-Goldberg without making every sensitive instruction privileged in the traditional sense.
RISC-V and ARM Extensions:
- RISC-V (H extension): H mode is orthogonal to S (system) and U (user) modes — the hypervisor runs in a fully independent privilege dimension.
- ARM EL2: Strictly more privileged than EL0 (user) and EL1 (system) but not orthogonal — it sits in the same privilege hierarchy. EL3 (TrustZone secure monitor) is orthogonal to all other modes and splits resources between a trusted and non-trusted execution environment; the trusted side can implement a software TPM (Trusted Platform Module).

VT-x root and non-root modes: guest OS and apps run in non-root mode (rings 0 and 3). WSCs use type-1 hypervisors that boot directly on hardware in root mode ring 0. On personal computers a host OS occupies root mode ring 0 and the type-2 hypervisor runs as a host application in root mode ring 3.

Memory Virtualization

Memory virtualization is critical to both the safety and performance of hypervisors. The guest OS manages its own virtual memory on top of a guest physical address space, while the hypervisor maps that guest physical space onto real hardware memory — creating a two-level translation chain that must be handled efficiently without compromising isolation.

Address Space Hierarchy: Three distinct layers exist in a virtualized system:
- HPM (Host Physical Memory): The actual DRAM. The hypervisor allocates portions to each VM and ensures VMs cannot read or write each other’s allocation.
- GPM (Guest Physical Memory): A linear address space managed by the guest OS. The guest OS has no visibility into how the hypervisor maps GPM onto HPM. Infrequently used GPM pages may be paged to secondary storage by the hypervisor.
- GVM (Guest Virtual Memory): Per-process address space managed by the guest OS on top of GPM.
- One HPM shared across all VMs; one GPM per VM; one GVM per guest process.
Shadow Page Tables: The pre-EPT software approach. The hypervisor intercepts all OS memory management events (MMU and TLB management instructions) and constructs a direct GVM→HPM mapping used to refill the TLB. The hypervisor also decides whether each GPM page is backed by HPM or secondary storage. Shadow paging depends on DBT or paravirtualization to intercept every OS memory event and is considered the most complex component of a hypervisor.
Hardware Extended Page Tables (EPT): VT-x extends the MMU and TLB to natively process both the guest-managed GVM→GPM tables and the hypervisor-managed GPM→HPM tables. On a TLB miss the hardware page table walker traverses both structures automatically. The guest OS is invoked on a GVM/GPM page fault; the hypervisor is invoked on a GPM/HPM page fault. ARM and RISC-V define equivalent support.
Translation Overhead: On a TLB miss in a non-virtualized x86 system with 4-level page tables, the hardware walker accesses 4 memory locations (PML4 → PDPT → PD → PT). With EPT, each of those 4 GPM addresses must itself be translated through the 4-level GPM→HPM table, plus the base register CR3 is also a GPM address: $(4 + 1) \times 4 = 24 memory accesses$
Mitigations: Larger TLBs, 2 MiB pages (supported by Intel and AMD) to increase TLB coverage, and MMU caches that hold multilevel page table entries (returning several linked entries per access) to reduce walk cost.

AMAT Example: Non-virtualized x86, baseline AMAT = 2.2 cycles (cache + L1 TLB misses), L2 TLB miss rate = 0.2%, each page table access = 50 cycles:

$AMAT = 2.2 + 0.002 \times 50 \times 4 = 2.6 cycles$

With virtualization (24 accesses per walk), to keep AMAT = 2.6 cycles the L2 TLB miss rate must drop to $0.2% \div 6 = 0.033%$ . If the miss rate can only reach 0.005%, the page table walk latency must satisfy:

$0.4 = 0.00005 \times 24 \times walk cost ⟹ walk cost = 33.3 cycles$

An MMU cache holding multilevel entries is a practical way to hit this target.

EPT address translation for x86: CR3 holds the base of the GVM→GPM page table (a GPM address); EPTP holds the base of the GPM→HPM page table (an HPM address). On a TLB miss, the hardware walker traverses both 4-level tables, requiring 24 main memory accesses in total:

I/O Virtualization

Baseline Interception: The hypervisor intercepts all VM interactions with I/O hardware via vmexit on PIO instructions, loads/stores to MMIO addresses, and I/O interrupts. DBT or paravirtualization can also introduce explicit vmcalls for I/O. Interposition has three benefits:
- Emulation: The hypervisor translates the I/O interface assumed by the guest OS to whatever hardware is actually present, letting old VMs run against emulated devices and letting unmodified VMs benefit from new hardware features their older drivers cannot configure.
- Live migration: Because the hypervisor observes all I/O state, it can quiesce I/O, pause the VM, copy all state (registers, memory, hypervisor-tracked state) to another machine, and resume — the foundation of live migration.
- Policy: The hypervisor can enforce per-VM resource limits, e.g., cap the network bandwidth each VM can generate.
Performance Problem: Interposition overhead becomes prohibitive for high-speed devices. A 200 Gbps NIC can generate up to 400 million packets per second — intercepting every packet is too expensive. This motivates hardware support for direct VM access to I/O.
Direct Access Target: Ideally: (1) guest OS programs the device directly via MMIO; (2) the device DMA-copies data directly from VM memory; (3) the device interrupts the VM directly — all without hypervisor involvement. EPT already enables step 1 by mapping HPM addresses of I/O device registers into the GPM or GVM address space of the guest.
IOMMU / VT-d (steps 2 and 3): ARM and RISC-V define equivalent mechanisms.
- DMA Remapping (DMAR): When the DMA engine issues a memory access, the IOMMU uses the PCIe device identity (8-bit bus, 5-bit device, 3-bit function) to look up a context identifier pointing to the correct page tables, then translates GPM→HPM. Translations are cached in the IOMMU TLB. Early IOMMUs lacked a hardware page table walker — on a TLB miss they invoked the hypervisor to perform the translation. The IOMMU TLB faces dual pressure: high capacity (low miss rate) and high throughput (400M DMA accesses/sec per direction for a 200 Gbps NIC).
- Interrupt Remapping (IR): When a device generates an interrupt it sends a message to the core’s APIC. Under VT-d, each VM has a virtual APIC memory structure in its VMCS. The IR uses the device identity to look up the target VMCS, rewrites the interrupt message, and updates the virtual APIC. The physical APIC then notifies the core running the VM. The physical APIC decides whether to raise the interrupt immediately or defer it — e.g., if the guest OS is already handling a higher-priority interrupt. When raised, the guest interrupt handler runs without any hypervisor intervention. This is posted interrupts. Posted interrupts also apply to inter-processor interrupts (IPIs) between two VMs or two processes in the same VM.
SR-IOV: Divides a physical PCIe device into one Physical Function (PF, managed by the hypervisor) and multiple Virtual Functions (VFs, mapped directly into guest VMs for direct access without sharing overhead).

The Xen Project

Open-source, introduced in 2003, used as the initial hypervisor in AWS. Introduced paravirtualization to handle x86 safety and performance before VT-x/VT-d existed — modifying ~3,000 lines (~1% of x86-specific code) in Linux. No ABI changes; Linux workloads run unmodified. Linux supports Xen paravirtualization natively as of 2011.

Xen Privilege Layout (pre-VT-x): Used x86’s three protection rings: Xen at ring 0, guest OS at ring 1, guest applications at ring 3. Guest OS could allocate pages freely; Xen only checked that protection restrictions were not violated. Xen mapped itself into the upper 64 MiB of each VM’s address space to avoid TLB flushes on shadow page table transitions.
Xen Architecture:
- Dom0 (Privileged domain): Boots first, manages regular guest VMs, runs backend hardware I/O drivers.
- Guest domains: Run frontend virtual device drivers that call into Dom0 to complete I/O requests.
- Driver domains: Unprivileged VMs responsible for a specific I/O device and its backend driver — isolate driver crashes from Dom0 and eliminate Dom0 as a bottleneck. Driver domains use the IOMMU (VT-d) to protect other domains from backend driver bugs and exploits.
Evolution: From 2006, Xen uses VT-x to run unmodified guest OSes (e.g., Windows). VT-d added later for I/O performance and security. Xen now supports lightweight VMs that mix hardware-assisted virtualization with paravirtualization to reduce memory footprint — smaller VMs mean more VMs fit per server, which matters at WSC scale.

My Knowledge Base

Explorer

2 Virtualization