Computer Architecture of Warehouse-Scale Computers

Warehouse-scale computers (WSCs) integrate 50,000 to 100,000 compute servers and storage blocks interconnected via a hierarchical network.
Hardware is physically structured in racks containing IT equipment, AC to DC power converters, and backup batteries.
Racks measure 19 inches wide by 48 inches deep, with heights ranging from 70 to 84 inches; vertical space is measured in rack units where $1 R U = 1.75$ inches or $4.44$ cm.
The quantity of servers per rack is strictly bounded by available power (typically 10 to 30 KW per rack), cooling limits, airflow constraints, and available network fabric bandwidth.

The extreme density of these physical racks necessitates specialized and varied compute hardware to efficiently process a wide array of distributed workloads.

WSC Compute

WSCs deploy heterogeneous server configurations to optimize hardware capabilities and cost metrics for specific workload profiles.
Standard Server Configurations:
- 1RU 2-socket servers provide balanced compute and memory capacity for general-purpose applications and virtual machine (VM) slicing.
- Multi-node architectures (e.g., Meta Yosemite) pack multiple 1-socket server blades into a shared chassis to increase density.
- Multi-node blades share a baseboard management controller (BMC) for secure boot, voltage regulation, and sensor management, as well as a network interface card (NIC).
- A shared NIC utilizes a PCIe switch to expose physically independent interfaces to each server blade, providing cost-effective networking for scale-out microservices and webservers.
Domain-Specific Accelerators (DSAs):
- Machine learning (ML) tasks utilize specialized servers hosting 4 to 16 high-end DSA chips, with each chip drawing $800$ W or more.
- DSA nodes (e.g., NVIDIA DGX H100) interface with three distinct networks: the standard WSC network, a high-bandwidth internal interconnect linking local DSAs (e.g., NVLink at $900$ GB/s per GPU), and an interserver interconnect for distributed training.
CPU and Memory Scaling Trends:
- Major cloud providers increasingly deploy custom Arm-based CPUs featuring out-of-order execution, achieving competitive single-thread performance against x86 architectures.
- Modern CPU sockets utilize copackaged chiplets to integrate 64 or more cores, driving socket power consumption up to $400$ W and requiring massive memory bandwidth.
- Compute Express Link (CXL) enables cache-coherent memory disaggregation over PCIe, allowing CPUs to access expanded memory pools or share memory across servers.

As compute nodes scale in core density and specialization, the persistence and retrieval of their data require equally scalable and highly decoupled storage architectures.

WSC Storage

Local Storage:
- Servers may include 1 to 8 direct-attached HDDs or NVMe Flash SSDs for workloads demanding high localized throughput (e.g., $400$ Gbps for 8 NVMe SSDs).
- Data on local storage is encrypted and inherently ephemeral; persistence requires VM software to copy data across multiple instances.
Distributed Storage:
- Decoupling compute hardware from persistent storage allows VMs to migrate or restart without data loss while transferring persistence, scaling, and recovery responsibilities to the cloud provider.
- Hardware Arrays:
  - JBOD (Just a Bunch of Disks): Petascale HDD arrays serving high-capacity needs (e.g., Meta Grand Canyon fitting 72 HDDs in 4U).
  - JBOF (Just a Bunch of Flash): NVMe SSD arrays utilizing M.2 or U.2 form factors for high-throughput tiers (e.g., Meta Lightning).
- Software-Defined Reliability:
  - Arrays connect via PCIe to head nodes; WSC software running on these nodes implements replication and erasure coding across the entire facility to survive disk, rack, or network failures.
  - Complete replication (e.g., 3x) tolerates multiple failures and offers instantaneous recovery but requires a 200% storage capacity overhead.
  - Reed-Solomon (RS) erasure codes lower capacity overhead but introduce computational and network latency during recovery; $RS (9, 6)$ requires 50% overhead and reading 6 blocks to reconstruct 1, while $RS (14, 10)$ uses 40% overhead but requires reading 10 blocks.

Connecting these vast, decoupled pools of compute and distributed storage demands a fault-tolerant network fabric capable of extreme bisection bandwidth.

WSC Networking

Connecting 50,000+ nodes via a perfect crossbar for full bisection bandwidth is impossible, as port costs scale quadratically ( $n^{2}$ ).
Clos Topologies:
- WSCs utilize multi-stage Clos networks to approximate crossbar connectivity dynamically and cost-effectively.
- The network is structured in layers: Top-of-Rack (TOR or L1) switches, Leaf (L2) switches, and Spine (L3) switches.
- Oversubscription intentionally limits bandwidth to reduce costs; a 3:1 oversubscription ratio at the TOR means downlink capacity to servers is three times the uplink capacity to Leaf switches.
- To mitigate oversubscription congestion, WSC software colocates interdependent workloads, randomizes paths through the multi-path Clos network, and distributes processes to avoid single points of failure.
Optical Circuit Switches (OCS):
- OCS units deploy microelectromechanical systems (MEMS) mirrors to dynamically alter fiber optic connections between switches in milliseconds.
- Because OCS mirrors only reflect light and do not process packets, they are rate-agnostic and support transparent upgrades to higher link speeds.
- OCS layers enable topology engineering, allowing operators to directly configure bandwidth between Leaf switches to match daily traffic patterns instead of relying entirely on Spine routing.
Control Plane: WSC fabrics utilize centralized Software Defined Networking (SDN) controllers for real-time routing, traffic engineering, and congestion control rather than decentralized Internet protocols like BGP.

The physical boundaries and switching layers defined by this network topology dictate the strict performance parameters that software developers must navigate.

The Programmer’s View of a WSC

WSC memory and storage hierarchies scale across three physical boundaries: Local Server, Rack, and WSC Row.
Latency and Bandwidth Scaling:
- Local Server: Accessing DRAM takes $\approx 100$ ns at $409.6$ GB/s. NVMe Flash takes $100$ $μ$ s at $6.9$ GB/s. Local disk takes $5$ ms at $300$ MB/s.
- Rack Level (e.g., 48 servers): Network transit adds $100$ $μ$ s to DRAM and Flash accesses, and $500$ $μ$ s to disk accesses. Bandwidth is bottlenecked by the server’s NIC (e.g., $12.5$ GB/s for a 100 Gbps NIC).
- Row Level (e.g., 30 racks): Network switches and congestion add $200$ $μ$ s to DRAM, $300$ $μ$ s to Flash, and $1$ ms to disk. Bandwidth drops due to Clos network oversubscription (e.g., $2.5$ GB/s).
Architectural Implications:
- Because inherently slow devices like Flash and magnetic disks operate in microseconds or milliseconds, the addition of network latency is negligible. This validates the decoupled distributed storage architecture used in WSCs.
- Conversely, network transit degrades DRAM latency catastrophically (from $100$ ns locally to $100$ $μ$ s remotely).
Remote Direct Memory Access (RDMA): To mitigate network latency for remote memory, RDMA implements transport protocols (like InfiniBand or RoCE) directly in hardware on the NIC. This allows zero-copy networking, bypassing OS software overheads and reducing remote memory latency to a few microseconds.

Are you currently working on designing or deploying a distributed application that needs to account for these specific WSC latency and bandwidth constraints?

My Knowledge Base

Explorer

03 WSC Architecture

Computer Architecture of Warehouse-Scale Computers

WSC Compute

WSC Storage

WSC Networking

The Programmer’s View of a WSC