Persistence, Storage, and Devices

The integration of input/output (I/O) devices into computer systems bridges the gap between processing architectures and persistent data storage. Efficient management of physical storage hardware relies on specialized access protocols, hierarchical buses, and sophisticated scheduling algorithms, which together form the foundation for multi-device structures like RAID.

System Architecture and I/O Devices

Modern systems utilize a hierarchical bus structure to balance cost and performance constraints derived from physics; high-speed buses must be physically shorter.

Bus Hierarchy:
- Memory Bus: Connects the CPU directly to high-speed RAM.
- General I/O Bus (e.g., PCI): Connects higher-performance devices like graphics cards.
- Peripheral I/O Bus (e.g., SCSI, SATA, USB): Hosts slower devices such as disk drives and mice, enabling massive scaling of connected devices.
Canonical Device Components:
- Hardware Interface: Exposes status, command, and data registers to allow system software to control device operation.
- Internal Structure: Contains a microcontroller, volatile/non-volatile memory, and specialized logic chips to implement the device abstraction.
Device Interaction Methods:
- Explicit I/O Instructions: Privileged instructions (e.g., in and out on x86) specify a target device port and a register containing data to transfer.
- Memory-Mapped I/O: Hardware maps device registers to specific physical memory addresses. The OS communicates by issuing standard load and store instructions to these addresses.

Communication Protocols and CPU Overhead

The base protocol for device interaction is Programmed I/O (PIO), which requires the CPU to manually move data and poll device status. Advanced mechanisms exist to mitigate PIO overheads.

Programmed I/O (PIO) with Polling: The OS repeatedly reads the device’s status register until the device is ready, writes data to the data register, writes a command to the command register, and polls again until completion. Polling completely monopolizes CPU cycles during wait times.
Interrupts: Instead of polling, the OS issues an I/O request, sleeps the calling process, and schedules another task. When the device finishes, it raises a hardware interrupt, triggering an Interrupt Service Routine (ISR) to handle completion.
- Interrupts successfully overlap computation with I/O but incur context-switching overhead.
- For exceptionally fast devices or heavy network loads, polling may outperform interrupts by avoiding context switches and livelock. Hybrid approaches poll briefly before falling back to interrupts.
- Interrupt Coalescing: Delays interrupt delivery slightly to batch multiple completions into a single interrupt, trading minimal latency for reduced CPU overhead.
Direct Memory Access (DMA): A specialized hardware engine that orchestrates data transfers directly between memory and devices. The OS programs the DMA controller with the memory location, transfer length, and target device. The DMA engine executes the transfer independently and interrupts the CPU only upon completion, bypassing PIO data-movement bottlenecks.
Device Drivers: OS subsystems encapsulating hardware-specific device interactions. They present generic block-layer interfaces to the file system, abstracting away complex protocols (e.g., SCSI or SATA). Because any external hardware requires a driver, they constitute over 70% of modern OS code and are a primary source of kernel bugs.

To understand how device drivers physically manipulate data, it is necessary to map the generic block-layer abstractions to the geometric and mechanical realities of the standard persistent medium: the hard disk drive.

Hard Disk Drives (HDDs)

A hard disk drive persistently stores data and presents a standardized interface consisting of a contiguous array of 512-byte sectors. The drive guarantees that 512-byte sector writes are atomic, meaning they will either complete entirely or not at all, preventing torn writes.

Physical Geometry:
- Platter & Spindle: Circular hard surfaces coated in a magnetic layer, bound to a motor-driven spindle spinning at a constant Rotations Per Minute (RPM) (typically 7,200 to 15,000 RPM).
- Track: A single concentric circle of sectors on a surface.
- Disk Head & Arm: A sensor attached to a mechanical arm that sweeps across the platter to read/write specific tracks.
Advanced Geometries:
- Track Skew: Sectors are staggered across adjacent tracks to account for the time required to switch tracks, ensuring sequential reads do not miss the next logical block and incur an entire rotational penalty.
- Multi-zoned Drives: Outer tracks encompass greater physical circumference and contain more sectors than inner tracks.
- Track Buffer / Cache: Built-in RAM (8-16 MB) caches read/written tracks. Drives may utilize write-back caching (acknowledging writes immediately upon buffering) for speed, or write-through caching for reliability against power loss.
I/O Time Performance Model:
- The total time for an I/O request consists of three components: $T_{I / O} = T_{see k} + T_{ro t a t i o n} + T_{t r an s f er}$ .
- Seek Time ( $T_{see k}$ ): The time to position the disk arm over the correct track, encompassing acceleration, coasting, deceleration, and settling phases. By calculating all possible distances, the average seek distance is exactly one-third of a full disk sweep.
- Rotational Delay ( $T_{ro t a t i o n}$ ): The time waiting for the target sector to rotate under the head. Average rotational delay is $\frac{R}{2}$ , where $R$ is the time of one full rotation.
- Transfer Time ( $T_{t r an s f er}$ ): The time to read/write the data at the drive’s peak transfer rate.
Workload Performance: The rate of I/O is calculated as $R_{I / O} = \frac{S i z e _{T r an s f er}}{T _{I / O}}$ . Because $T_{see k}$ and $T_{ro t a t i o n}$ strictly dominate small random I/O requests, sequential workloads ( $S$ ) vastly outperform random workloads ( $R$ ), achieving magnitudes higher throughput ( $S ≫ R$ ).

Because mechanical positioning drastically affects performance ( $T_{see k}$ and $T_{ro t a t i o n}$ ), the operating system and disk controllers actively manipulate request execution orders to maximize $R_{I / O}$ .

Disk Scheduling

Disk schedulers predict the physical service time of I/O operations and reorder queues to approximate Shortest Job First (SJF) execution.

SSTF (Shortest Seek Time First): Services requests closest to the current track. While improving seek efficiency, a steady stream of requests to the current track causes starvation for distant tracks. OS implementations use Nearest-Block-First (NBF) to approximate tracks using logical block addresses.
SCAN (Elevator) & C-SCAN: The disk arm sweeps end-to-end across the disk, servicing requests in path order. F-SCAN freezes the active queue during a sweep to delay incoming localized requests. C-SCAN sweeps in one direction only (outer-to-inner), providing fairer service times to edge tracks than bidirectional SCAN. SCAN prevents starvation but ignores rotational delay costs.
SPTF / SATF (Shortest Positioning Time First): Evaluates both seek and rotation times to minimize total physical delay. If seek costs are vastly lower than rotational costs, jumping to a distant track may be faster than waiting for rotation on a nearby track. SPTF requires precise head position tracking and is implemented internally by the drive controller rather than the OS.
I/O Merging & Anticipatory Scheduling: Schedulers merge adjacent sector requests into single multi-block requests. Anticipatory (non-work-conserving) schedulers temporarily pause execution after an I/O completes to wait for highly efficient, contiguous requests rather than immediately seeking distant random requests.

While sophisticated scheduling maximizes the throughput of an individual drive, single HDDs remain fundamentally bound by mechanical limits and represent a single point of failure. These limitations necessitate combining multiple disks into a logical array.

Redundant Arrays of Inexpensive Disks (RAID)

RAID groups multiple physical disks into a unified storage system to improve capacity, performance, and reliability. Transparent to the host OS, a hardware RAID presents itself as a single block device while internally utilizing microcontrollers, volatile/non-volatile memory, and custom parity logic to route physical I/O requests.

Fault Model (Fail-Stop): RAIDs typically assume a fail-stop model where a disk is either completely operational or entirely failed, and the array controller can instantaneously detect this failure state.
Evaluation Metrics: RAID designs are evaluated on useful Capacity, Reliability (number of tolerated disk failures), and Steady-state Throughput (Performance under Sequential and Random Read/Write workloads).

RAID Levels

By varying data striping and redundancy techniques, different RAID levels optimize specific subsets of the evaluation metrics. Performance assumes $N$ disks, a peak sequential disk bandwidth $S$ , and a peak random disk bandwidth $R$ .

RAID Level 0: Striping
- Spreads contiguous blocks across $N$ disks without redundancy.
- Chunk size trade-off: Small chunk sizes stripe individual files across many disks, maximizing intra-file parallelism but incurring worst-case seek times for every request. Large chunk sizes isolate files to single disks, reducing positioning time but relying entirely on concurrent inter-file I/O to achieve parallelism.
- Capacity: $N \cdot B$ .
- Reliability: 0 failures tolerated.
- Performance: All disks are fully utilized. Sequential Read/Write: $N \cdot S$ . Random Read/Write: $N \cdot R$ .
RAID Level 1: Mirroring
- Maintains duplicate copies of every block on distinct physical disks (e.g., RAID-10 stripes across mirrored pairs, RAID-01 mirrors striped arrays).
- Consistent-update problem: If power is lost after writing only one mirror copy, the copies become inconsistent. Mitigated using battery-backed NVRAM to write-ahead log operations before issuing physical disk writes.
- Capacity: $(N \cdot B) /2$ .
- Reliability: Guarantees tolerance of 1 disk failure (up to $N /2$ if independent mirror pairs fail).
- Performance:
  - Sequential Write: $(N /2) \cdot S$ (Half bandwidth, as each logical write dictates two physical writes).
  - Sequential Read: $(N /2) \cdot S$ (Peak throughput is halved because each disk skips every other block, inducing unutilized rotational delays).
  - Random Read: $N \cdot R$ (Reads distribute fully across all disks).
  - Random Write: $(N /2) \cdot R$ (Parallel physical writes execute, but effectively yield half the array’s raw capability).
RAID Level 4: Parity
- Reserves one dedicated disk to store parity data computed via bitwise XOR over the blocks of a stripe. The XOR invariant ensures an even number of 1 bits in any stripe row, enabling the reconstruction of any single missing block.
- Capacity: $(N - 1) \cdot B$ .
- Reliability: 1 disk failure tolerated.
- Performance:
  - Sequential Read/Write: $(N - 1) \cdot S$ . Sequential writes execute as “full-stripe writes”, XORing the contiguous data in memory and writing all disks in parallel.
  - Random Read: $(N - 1) \cdot R$ .
  - Random Write: $\frac{R}{2}$ . Driven by the small-write problem. Updating a single block requires the Subtractive Parity method ( $P_{n e w} = (D_{o l d} \oplus D_{n e w}) \oplus P_{o l d}$ ). Each logical write requires reading old data and old parity, computing new parity, and writing new data and new parity. Because all writes update the solitary parity disk, parallel random writes are entirely serialized at the parity disk, capping maximum system throughput to half of a single disk’s random bandwidth.
RAID Level 5: Rotating Parity
- Identical to RAID-4, but rotates the designated parity block uniformly across all disks in the array. This eliminates the single-disk parity bottleneck that severely limits RAID-4.
- Capacity & Reliability: Identical to RAID-4.
- Performance:
  - Sequential Read/Write: $(N - 1) \cdot S$ .
  - Random Read: $N \cdot R$ (utilizes all disks).
  - Random Write: $\frac{N}{4} \cdot R$ . Parallelism is restored, but each logical write still incurs the 4-I/O penalty (read old data/parity, write new data/parity) mandated by subtractive parity updates.

My Knowledge Base

Explorer

05 Persistence Storage and Devices