I/O Devices
High-performance networking and storage components, specifically network interfaces (NICs) and Flash storage devices, dictate the bandwidth and latency capabilities of Warehouse-Scale Computers (WSCs). Many of their hardware and software optimizations also apply to other I/O devices attached to CPUs or directly to DSAs.
Network Interfaces
- A basic NIC integrating two Ethernet ports (e.g., Gbps) utilizes an Ethernet medium access control (MAC) block paired with transmit and receive queues for each port.
- The MAC executes the physical and data link layers in hardware, while higher-level protocols (such as IP and TCP) are executed in host software.
- The hardware queues decouple the MAC layer from the software stack. Each queue must hold at least a few full Ethernet packets; jumbo packets can be up to bytes.
- Packet Transmission:
- User-level software prepares outgoing data in a memory buffer and initiates a system call.
- The OS copies the data, applies TCP/IP headers and trailers, and writes a packet descriptor (memory address and destination port) to a circular send queue (SQ) in the host’s memory.
- The OS advances the SQ tail pointer and triggers a “doorbell” by writing to a memory-mapped register on the NIC.
- The NIC increments its SQ head pointer, programs a direct memory access (DMA) engine to pull the packet from host memory, writes to the completion queue (CQ), and issues an interrupt to the host.
- Completion does not always mean the OS can immediately free the buffer; TCP may keep it until an acknowledgment arrives, and failed transmissions may require retransmission.
- Packet Reception:
- The OS provisions a circular receive queue (RQ) with the addresses of allocated memory buffers. RQ and CQ are circular host-memory buffers with head and tail pointers.
- When a packet arrives, the NIC retrieves an address from the RQ, leverages DMA to push the packet into the OS buffer, writes a descriptor to the CQ, and interrupts the host.
- The OS processes the protocol headers, checks error codes, extracts the payload to a user-level buffer, and deallocates or recycles its internal buffer space while potentially transmitting an acknowledgment.
NIC Optimizations
- CPU Overhead Mitigation:
- Stateless offloads accelerate networking by generating and verifying error detection codes (for Ethernet, IP, TCP, and UDP) directly in the NIC hardware.
- Large segment offloads divide massive messages into network-sized packets at the sender and reassemble them at the receiver, drastically reducing the frequency of OS protocol execution (e.g., running the stack time instead of times for a MB payload).
- Large segment offload requires larger NIC buffers and must tolerate packets that arrive reordered or are occasionally dropped inside the WSC fabric.
- Interrupt Overhead Mitigation:
- A Gbps NIC receiving minimum-sized packets can generate interrupts per second.
- Interrupt moderation throttles the interrupt rate by grouping multiple packet arrivals into a single interrupt signal.
- Receive side scaling (interrupt steering) hashes packet connection data to distribute interrupts across multiple CPU cores, avoiding single-core bottlenecks and minimizing synchronization delays.
- Targeted interrupts direct signals specifically to the core that recently executed send system calls for the active connection, preventing unnecessary context switches.
- Polling bypasses interrupts entirely by having the OS or application continuously check the NIC for incoming data.
- Some OSes use a hybrid strategy: after an interrupt enters the networking stack, the OS polls briefly to amortize the interrupt cost across many packets.
- Memory and Caching Optimizations:
- Unoptimized network processing consumes memory bandwidth to times higher than the actual network link speed due to redundant buffer copying.
- Data direct I/O (DDIO) injects incoming packets straight into the CPU’s last-level cache to reduce main memory traffic.
- Header splitting separates packet metadata from the payload, improving cache locality during OS header processing.
- Zero-copy networking uses preposted buffer addresses to DMA payloads directly from the NIC into application memory, eliminating OS-level copies.
- SmartNICs and RDMA:
- Infiniband (IB) and Remote Direct Memory Access over Converged Ethernet (RoCE) offload the entire network protocol stack to the NIC, enabling direct hardware-to-hardware transfers.
- SmartNICs deploy programmable embedded cores to offload complex operations, such as remote storage protocols, hypervisor tasks, and virtual private network encapsulation and security.
- Public-cloud NICs often include hardware for network virtualization, adding and removing private-network headers while enforcing security rules in a shared WSC.
NVMe Flash Drives
- Nonvolatile Memory Express (NVMe) drives arrange vertically stacked Flash memory chips (frequently chips per stack) into parallel channels to maximize throughput. Each channel handles one access at a time, but the controller overlaps accesses across channels.
- Media Constraints:
- Read operations retrieve pages ( to KB, plus spare error-coding storage) in tens of s.
- Write operations are significantly slower ( to ms per page) and must be written sequentially within larger structural blocks consisting of to pages.
- Blocks must be entirely erased (taking several milliseconds) before any enclosed page can be rewritten.
- Flash cells possess limited endurance and degrade to the point of failure after to erase cycles.
- Writes are slow because Flash stores state as charge on a floating gate, commonly uses multilevel cells for density, and is organized as NAND arrays.
- Flash Translation Layer (FTL):
- Implemented as firmware on the NVMe device’s embedded cores, the FTL translates host logical block addresses (LBAs) into physical Flash coordinates (channel, package, block, and page).
- The LBA map is accessed on every read to find the physical page. On writes, the controller writes to the current write point, updates the map, and updates valid-page counts for the affected blocks.
- The FTL map and block info tables track erase status, erase count, block health, and valid-page counts. They are maintained in onboard DRAM and protected by a supercapacitor during power loss.
- The controller performs wear-leveling by directing new writes to Flash blocks that currently have low erase counts.
- The FTL also schedules pending accesses and may cache frequently accessed data in device DRAM.
- Garbage Collection:
- The FTL continuously scans the block info table to reclaim Flash blocks containing few valid pages.
- Any valid pages are copied to the current write point, and the old block is queued for erasure.
- This compaction process induces write amplification, which accelerates media wear-out and can stall subsequent read operations.
- Interface and Performance:
- NVMe drives communicate with the host CPU over PCIe using paired command and completion queues, mirroring the architecture of high-performance NICs.
- NVMe commands have variable latency and may complete out of order because reads can be delayed behind writes, garbage collection, or activity in the same package/channel.
- A typical NVMe device provides to TB of capacity, achieves unloaded read latencies of to s, processes to IOPS, and consumes only to W.
- A typical disk provides to TB, has unloaded read latency around to s, performs roughly to IOPS, and consumes to W.
I/O Device Virtualization
- Virtualizing I/O by forcing the hypervisor to intercept and emulate all guest VM requests creates severe latency and throughput bottlenecks.
- Modern high-performance I/O devices provide mechanisms allowing VMs to access them directly and safely, bypassing the hypervisor entirely.
- Single-Rooted I/O Virtualization (SR-IOV):
- Extends the PCIe standard, enabling a single I/O device to expose multiple independent interface functions.
- The Physical Function (PF) retains full device management and configuration capabilities and is controlled by the hypervisor.
- Multiple Virtual Functions (VFs) are mapped into the physical address spaces of guest VMs. VFs expose operational registers (send/receive, read/write) but block access to hardware management controls.
- IOMMU Integration:
- The IOMMU translates DMA addresses generated by VFs and remaps hardware interrupts directly into the corresponding guest VMs.
- Queue Pairs for Isolation:
- I/O devices allocate isolated queue pairs to discrete applications running within a single guest VM.
- This queue isolation enables the OS to enforce bandwidth throttling per application and allows for the direct delivery of incoming payloads into an application’s virtual address space.