Storage Systems
Disk Storage Technologies and Density
Magnetic disk capacity improvements are driven by areal density, defined as bits per square inch: Historically, areal density grew at rates up to 100% per year, but has stabilized around 30% per year. Despite cost per gigabyte dropping rapidly, a massive performance gap persists between DRAM and magnetic disks. DRAM latency is approximately 100,000 times lower than disk latency, but DRAM costs significantly more per gigabyte.
Flash memory fills a portion of this gap as a non-volatile semiconductor memory. Flash delivers bandwidth comparable to disks but with latency 100 to 1000 times faster. Flash storage is limited by wear-out mechanisms, typically restricting cells to 1 million write cycles.
Disk physical designs have shifted away from the traditional sector-track-cylinder assumptions. Modern disks feature:
- Serpentine block ordering: Logical blocks are ordered sequentially across a single surface to optimize for zones of varying recording densities.
- Intelligent scheduling: On-disk buffers, caches, and command queues reorder requests to maximize throughput and minimize mechanical delays.
- Reduced platters: High-density disks use fewer platters, diminishing the traditional importance of cylinder-aligned accesses.
Disk motor power is the primary source of power consumption and is modeled by: SATA drives optimize for capacity and cost (slower RPM, wider/more platters), whereas SAS drives optimize for performance (faster RPM, narrower platters) at the cost of higher power consumption.
To safely aggregate these high-capacity mechanical devices without unacceptable increases in failure rates, storage arrays must utilize redundancy mechanisms.
Redundant Arrays of Inexpensive Disks (RAID)
Disk arrays increase throughput by striping data across multiple drives, allowing parallel operations. Because adding devices decreases overall reliability, RAID introduces redundant disks to reconstruct data following a failure.
- RAID 0: Striped data with no redundancy.
- RAID 1: Mirrored data. Highest cost but fast recovery.
- RAID 2: Memory-style error-correcting codes (obsolete).
- RAID 3: Bit-interleaved parity using a single check disk. Optimal for large sequential reads/writes.
- RAID 4: Block-interleaved parity. Allows independent small reads. Small writes are a bottleneck, requiring 4 disk accesses: read old data, read old parity, write new data, write new parity.
- RAID 5: Block-interleaved distributed parity. Distributes the parity blocks across all disks to eliminate the single check disk bottleneck of RAID 4.
- RAID 10 vs. RAID 01: RAID 10 stripes data across mirrored pairs (“striped mirrors”). RAID 01 creates two striped sets and mirrors the sets (“mirrored stripes”).
- RAID 6 (Row-Diagonal Parity / RAID-DP): Protects against two simultaneous disk failures, utilizing total disks. Uses two check disks per stripe: one for row parity and one for diagonal parity. During a double failure, recovery alternates between resolving single missing blocks in diagonals and rows.
While RAID configurations mitigate data loss from physical disk degradation, broader storage system design requires precise definitions of how and why system components fail.
Faults, Errors, and Failures
Storage systems are held to stringent dependability standards, defined as the quality of delivered service such that reliance can justifiably be placed upon it.
- Failure: Occurs when a module’s actual behavior deviates from its specified ideal behavior.
- Error: A defect within a module. An error remains latent until activated; it causes a failure only when it affects the delivered service.
- Fault: The root cause of an error. Faults are categorized by cause:
- Hardware faults: Physical device malfunctions.
- Design faults: Bugs in software or flaws in hardware logic.
- Operation faults: Mistakes by operations and maintenance personnel.
- Environmental faults: External events like power outages, fires, or earthquakes.
Large-scale system observations, such as the Tertiary Disk cluster, demonstrate that environmental controls dictate component survival. Data disks properly housed with vibration and cooling controls exhibit high reliability, whereas peripheral components (cables, backplanes, ATA boot disks) fail frequently. Furthermore, human operator faults during maintenance are a primary cause of system crashes, necessitating architectures that tolerate operational errors.
Understanding how a system degrades under these faults directly influences the metrics used to evaluate the storage subsystem’s performance and availability.
I/O Performance and Benchmarks
I/O performance evaluates throughput (data rate or operations per second) and response time (latency). The producer-server model dictates that response time encompasses both time waiting in a buffer and time spent being serviced.
Throughput and response time share an inverse relationship under load. Pushing throughput past the “knee” of the performance curve causes response times to degrade exponentially. User productivity requires low response times, as human think time decreases proportionally with faster system responses.
Industry standard benchmarks measure storage performance under strict response time limits:
- TPC-C (Online Transaction Processing - OLTP): Measures disk accesses per second. Requires 90% of transactions to complete in under 5 seconds.
- SPEC SFS (System File Server): Measures NFS operations per second using a synthetic mix of reads, writes, and metadata operations. Requires average response times to remain under 40 ms.
Availability and performability are measured by injecting faults into a running system and monitoring Quality of Service (QoS) degradation over time. System policies impact these metrics; for example, initiating rapid RAID reconstruction degrades running application performance but minimizes the window of vulnerability to a second fatal disk failure.
To accurately predict these trade-offs in response time and throughput under varying loads without relying purely on empirical fault testing, architects utilize mathematical modeling.
Queuing Theory for Storage Systems
Queuing theory provides analytical models for predicting I/O system behavior assuming the system is in steady state (equilibrium), where the input rate equals the output rate.
Little’s Law establishes the foundational relationship for equilibrium systems:
Single-Server Model (M/M/1)
Characterized by exponentially distributed arrival and service times ().
- Server Utilization: Utilization must remain between 0 and 1 to maintain equilibrium.
- Time in Queue:
- Length of Queue:
- System Response Time:
Multiple-Server Model (M/M/m)
Characterized by parallel service units.
- System Utilization:
- Probability of Tasks :
- Time in Queue:
These queuing models dictate the physical integration of I/O devices, determining how buses, controllers, and operating system abstractions are structured to handle projected loads.
I/O Interconnects and Software Abstractions
Physical I/O buses have largely migrated from wide, parallel architectures to high-speed, point-to-point serial links (e.g., PCI Express, Serial ATA). Serial links transmit concurrently in both directions and scale bandwidth by aggregating multiple wire pairs.
At the software layer, the operating system dictates how these hardware channels are utilized:
- Block Servers vs. Filers: Storage arrays traditionally present virtualized block volumes (Logical Units) to a host server, which manages the file system and metadata. Alternatively, Network Attached Storage (NAS) devices (Filers) natively manage the file system abstraction, allowing host servers to request data over protocols like NFS and CIFS.
- Asynchronous I/O: Disks suffer from high mechanical latency. Synchronous I/O blocks a process until data arrives. Asynchronous I/O permits a process to issue multiple overlapping requests, utilizing parallelism to hide mechanical latency, analogous to non-blocking caches in CPUs.
The intersection of these hardware interconnects and software abstractions directly forms the blueprint for deploying massive, real-world storage clusters.
Designing and Evaluating an I/O System
I/O system design requires identifying bottlenecks across CPUs, memory buses, disk controllers, and physical disks. The performance of the system is strictly limited by the weakest link in this chain.
Design Steps:
- List supported I/O devices, buses, and networks.
- Identify physical, power, and connectivity limits.
- Determine component costs and reliability metrics.
- Record processor overhead (instructions per I/O, interrupt handling).
- Calculate memory and I/O bus bandwidth constraints.
- Evaluate aggregate performance, capacity, and availability topologies.
Internet Archive Cluster Example: A cluster utilizing 1U nodes with 4 PATA disks and a Gigabit Ethernet switch. Analysis of the data path reveals that while the CPU can handle 6,667 IOPS and the memory bus 133,000 IOPS, the physical disks bottleneck the system at ~73-77 IOPS per drive. Consequently, total node throughput is severely restricted by mechanical disk seek limits. Because individual nodes lack internal redundancy, system-level dependability is achieved by geographically replicating the entire dataset across multiple data centers.
NetApp FAS6000 Filer Example: An enterprise filer using AMD Opteron processors, large distributed DDR memory, and NVRAM for accelerated write logging. The system integrates massive I/O capacity (Fibre Channel, PCIe, Gigabit Ethernet) to support up to 1008 drives using RAID-DP across custom SCSI/SATA bridges.
Analyzing such complex deployments requires architecting against common industry misconceptions that frequently lead to suboptimal storage designs.
Fallacies and Pitfalls in Storage Design
- Fallacy: 99.999% availability is a standard achievement. “Five nines” equates to only 5 minutes of unavailability per year. Due to patching, configuration errors, and maintenance, well-managed servers practically achieve 99% to 99.9% availability.
- Pitfall: Moving RAID to software improves reliability. Software is difficult to isolate from OS environment variables and is highly susceptible to patch incompatibilities. Hardware RAID controllers offer strict correctness testing and isolate data protection from host OS crashes.
- Fallacy: Operating systems are the best place to schedule disk accesses. Operating systems operate on logical block addresses and do not understand the underlying disk geometry. Disk drives feature onboard controllers that execute Command Queuing to reorder requests based on precise track and sector positions, servicing accesses much faster than OS-ordered queues.
- Fallacy: Average seek time equals a seek across 1/3 of the cylinders. Seek time is non-linear; disk arms must accelerate, decelerate, and settle. Correct seek modeling requires square root functions for short distances. Furthermore, real workloads exhibit massive spatial locality, resulting in a high percentage of zero-distance seeks (accesses within the same cylinder), rendering naive distance formulas useless for performance prediction.