I/O Devices

High-performance networking and storage components, specifically network interfaces (NICs) and Flash storage devices, dictate the bandwidth and latency capabilities of Warehouse-Scale Computers (WSCs). Many of their hardware and software optimizations also apply to other I/O devices attached to CPUs or directly to DSAs.

Network Interfaces

A basic NIC integrating two Ethernet ports (e.g., $100$ Gbps) utilizes an Ethernet medium access control (MAC) block paired with transmit and receive queues for each port.
The MAC executes the physical and data link layers in hardware, while higher-level protocols (such as IP and TCP) are executed in host software.
The hardware queues decouple the MAC layer from the software stack. Each queue must hold at least a few full Ethernet packets; jumbo packets can be up to $9000$ bytes.
Packet Transmission:
- User-level software prepares outgoing data in a memory buffer and initiates a system call.
- The OS copies the data, applies TCP/IP headers and trailers, and writes a packet descriptor (memory address and destination port) to a circular send queue (SQ) in the host’s memory.
- The OS advances the SQ tail pointer and triggers a “doorbell” by writing to a memory-mapped register on the NIC.
- The NIC increments its SQ head pointer, programs a direct memory access (DMA) engine to pull the packet from host memory, writes to the completion queue (CQ), and issues an interrupt to the host.
- Completion does not always mean the OS can immediately free the buffer; TCP may keep it until an acknowledgment arrives, and failed transmissions may require retransmission.
Packet Reception:
- The OS provisions a circular receive queue (RQ) with the addresses of allocated memory buffers. RQ and CQ are circular host-memory buffers with head and tail pointers.
- When a packet arrives, the NIC retrieves an address from the RQ, leverages DMA to push the packet into the OS buffer, writes a descriptor to the CQ, and interrupts the host.
- The OS processes the protocol headers, checks error codes, extracts the payload to a user-level buffer, and deallocates or recycles its internal buffer space while potentially transmitting an acknowledgment.

NIC Optimizations

CPU Overhead Mitigation:
- Stateless offloads accelerate networking by generating and verifying error detection codes (for Ethernet, IP, TCP, and UDP) directly in the NIC hardware.
- Large segment offloads divide massive messages into network-sized packets at the sender and reassemble them at the receiver, drastically reducing the frequency of OS protocol execution (e.g., running the stack $1$ time instead of $112$ times for a $1$ MB payload).
- Large segment offload requires larger NIC buffers and must tolerate packets that arrive reordered or are occasionally dropped inside the WSC fabric.
Interrupt Overhead Mitigation:
- A $100$ Gbps NIC receiving minimum-sized packets can generate $200 \times 1 0^{6}$ interrupts per second.
- Interrupt moderation throttles the interrupt rate by grouping multiple packet arrivals into a single interrupt signal.
- Receive side scaling (interrupt steering) hashes packet connection data to distribute interrupts across multiple CPU cores, avoiding single-core bottlenecks and minimizing synchronization delays.
- Targeted interrupts direct signals specifically to the core that recently executed send system calls for the active connection, preventing unnecessary context switches.
- Polling bypasses interrupts entirely by having the OS or application continuously check the NIC for incoming data.
- Some OSes use a hybrid strategy: after an interrupt enters the networking stack, the OS polls briefly to amortize the interrupt cost across many packets.
Memory and Caching Optimizations:
- Unoptimized network processing consumes memory bandwidth $5$ to $8$ times higher than the actual network link speed due to redundant buffer copying.
- Data direct I/O (DDIO) injects incoming packets straight into the CPU’s last-level cache to reduce main memory traffic.
- Header splitting separates packet metadata from the payload, improving cache locality during OS header processing.
- Zero-copy networking uses preposted buffer addresses to DMA payloads directly from the NIC into application memory, eliminating OS-level copies.
SmartNICs and RDMA:
- Infiniband (IB) and Remote Direct Memory Access over Converged Ethernet (RoCE) offload the entire network protocol stack to the NIC, enabling direct hardware-to-hardware transfers.
- SmartNICs deploy programmable embedded cores to offload complex operations, such as remote storage protocols, hypervisor tasks, and virtual private network encapsulation and security.
- Public-cloud NICs often include hardware for network virtualization, adding and removing private-network headers while enforcing security rules in a shared WSC.

NVMe Flash Drives

Nonvolatile Memory Express (NVMe) drives arrange vertically stacked Flash memory chips (frequently $100 +$ chips per stack) into parallel channels to maximize throughput. Each channel handles one access at a time, but the controller overlaps accesses across channels.
Media Constraints:
- Read operations retrieve pages ( $1$ to $8$ KB, plus spare error-coding storage) in tens of $μ$ s.
- Write operations are significantly slower ( $1$ to $2$ ms per page) and must be written sequentially within larger structural blocks consisting of $64$ to $256$ pages.
- Blocks must be entirely erased (taking several milliseconds) before any enclosed page can be rewritten.
- Flash cells possess limited endurance and degrade to the point of failure after $100, 000$ to $1, 000, 000$ erase cycles.
- Writes are slow because Flash stores state as charge on a floating gate, commonly uses multilevel cells for density, and is organized as NAND arrays.
Flash Translation Layer (FTL):
- Implemented as firmware on the NVMe device’s embedded cores, the FTL translates host logical block addresses (LBAs) into physical Flash coordinates (channel, package, block, and page).
- The LBA map is accessed on every read to find the physical page. On writes, the controller writes to the current write point, updates the map, and updates valid-page counts for the affected blocks.
- The FTL map and block info tables track erase status, erase count, block health, and valid-page counts. They are maintained in onboard DRAM and protected by a supercapacitor during power loss.
- The controller performs wear-leveling by directing new writes to Flash blocks that currently have low erase counts.
- The FTL also schedules pending accesses and may cache frequently accessed data in device DRAM.
Garbage Collection:
- The FTL continuously scans the block info table to reclaim Flash blocks containing few valid pages.
- Any valid pages are copied to the current write point, and the old block is queued for erasure.
- This compaction process induces write amplification, which accelerates media wear-out and can stall subsequent read operations.
Interface and Performance:
- NVMe drives communicate with the host CPU over PCIe using paired command and completion queues, mirroring the architecture of high-performance NICs.
- NVMe commands have variable latency and may complete out of order because reads can be delayed behind writes, garbage collection, or activity in the same package/channel.
- A typical NVMe device provides $0.5$ to $10$ TB of capacity, achieves unloaded read latencies of $50$ to $100$ $μ$ s, processes $1 \times 1 0^{6}$ to $2 \times 1 0^{6}$ IOPS, and consumes only $5$ to $15$ W.
- A typical disk provides $1$ to $25$ TB, has unloaded read latency around $5000$ to $10, 000$ $μ$ s, performs roughly $100$ to $200$ IOPS, and consumes $5$ to $10$ W.

I/O Device Virtualization

Virtualizing I/O by forcing the hypervisor to intercept and emulate all guest VM requests creates severe latency and throughput bottlenecks.
Modern high-performance I/O devices provide mechanisms allowing VMs to access them directly and safely, bypassing the hypervisor entirely.
Single-Rooted I/O Virtualization (SR-IOV):
- Extends the PCIe standard, enabling a single I/O device to expose multiple independent interface functions.
- The Physical Function (PF) retains full device management and configuration capabilities and is controlled by the hypervisor.
- Multiple Virtual Functions (VFs) are mapped into the physical address spaces of guest VMs. VFs expose operational registers (send/receive, read/write) but block access to hardware management controls.
IOMMU Integration:
- The IOMMU translates DMA addresses generated by VFs and remaps hardware interrupts directly into the corresponding guest VMs.
Queue Pairs for Isolation:
- I/O devices allocate isolated queue pairs to discrete applications running within a single guest VM.
- This queue isolation enables the OS to enforce bandwidth throttling per application and allows for the direct delivery of incoming payloads into an application’s virtual address space.

My Knowledge Base

Explorer

4 IO Devices

I/O Devices

Network Interfaces

NIC Optimizations

NVMe Flash Drives

I/O Device Virtualization