The Block I/O Layer
Device Classifications
- Block devices: Hardware components accessed randomly via fixed-size chunks of data called blocks.
- Examples include hard drives, Blu-ray readers, and flash memory.
- Filesystems serve as the standard interface for block devices.
- Character devices: Hardware components accessed sequentially as a continuous stream of bytes.
- Examples include serial ports and keyboards.
- Block devices require a dedicated management subsystem because random access necessitates complex navigation across the media, presenting opportunities for critical performance optimizations.
To interface with physical media, the block I/O layer must map hardware-level addresses to software-level abstractions.
Sectors and Blocks
- Sector: The smallest addressable unit of physical hardware.
- Sector sizes are powers of two, typically bytes, though sizes like kilobytes exist for CD-ROMs.
- Hardware-level sectors are often referred to as device blocks or hard sectors.
- Block: The smallest addressable unit utilized by the filesystem and kernel.
- Block sizes must be a multiple of the sector size and a power of two.
- Block sizes cannot exceed the system page size.
- Common block sizes are bytes, kilobyte, and kilobytes.
- Software-level blocks are often referred to as filesystem blocks or I/O blocks.
Once logical blocks are retrieved from physical sectors, they require an in-memory descriptor to manage their state and location.
Buffers and Buffer Heads
- Buffer: The in-memory representation of a single disk block.
- A single memory page can hold one or more buffers, depending on the block size relative to the page size.
- Buffer Head (
struct buffer_head): The descriptor structure that maps a specific in-memory buffer to its corresponding on-disk physical block.- Defined in
<linux/buffer_head.h>. b_state: Bit flags indicating buffer status, such asBH_Uptodate(contains valid data) orBH_Dirty(memory data is newer than disk data, requiring writeback).b_count: Usage reference counter, manipulated safely viaget_bh()andput_bh().b_blocknr: The logical block number on the physical device.b_page: Pointer to the physical memory page hosting the buffer.b_data: Pointer to the exact byte offset withinb_pagewhere the block data begins.
- Defined in
- In kernel versions prior to 2.6, buffer heads were the primary container for all block I/O operations.
- This legacy approach forced the kernel to break large I/O operations into fragmented, block-sized chunks, creating significant overhead.
To overcome the inefficiency of forcing all I/O operations into single-block buffer head containers, the kernel introduced a lightweight structure for active operations.
The bio Structure
struct bio: The primary container for in-flight (active) block I/O operations in the 2.6 kernel.- Defined in
<linux/bio.h>. - Represents I/O operations as a list of segments, enabling scatter-gather I/O where data originates from multiple non-contiguous physical pages.
- Defined in
- Segments (
struct bio_vec): Represents a contiguous chunk of a buffer in memory.- Acts as a vector mapping
<page, offset, len>. bv_page: Pointer to the physical page.bv_offset: Byte offset within the page.bv_len: Length of the buffer in bytes.
- Acts as a vector mapping
- Key
bioData Fields:bi_io_vec: Array ofbio_vecstructures comprising the entire I/O operation.bi_vcnt: Total number of vectors in thebio_vecarray.bi_idx: Current execution index in thebio_vecarray.bi_cnt: Usage reference counter, safely incremented bybio_get()and decremented bybio_put().
- The
bi_idxfield allows a singlebiostructure to be split mid-operation.- RAID drivers can duplicate a
biostructure and updatebi_idxto distribute the operation across multiple physical disks seamlessly.
- RAID drivers can duplicate a
- The
biostructure offers significant architectural advantages over the legacy buffer head approach.- Seamlessly handles high memory using physical pages instead of direct memory pointers.
- Supports both standard page cache I/O and direct I/O (bypassing the page cache).
- Contains only requisite I/O state information, minimizing structure footprint.
Buffer heads remain necessary to map individual disk blocks to memory pages, but active I/O tasks are now encapsulated within bio structures and queued for dispatch.
Request Queues
struct request_queue: Maintains a doubly linked list of pending block I/O requests.- Defined in
<linux/blkdev.h>. - Block device drivers continuously pull requests from the head of the queue for submission to the physical hardware.
- Defined in
struct request: Represents an individual item on the request queue.- A single request can contain multiple
biostructures if they operate on consecutive sectors on the disk, even if the memory segments themselves are discontinuous.
- A single request can contain multiple
Because physical hardware seek operations are highly latent, requests in this queue cannot be dispatched indiscriminately to the disk.
I/O Schedulers
- The I/O scheduler manages the
request_queueto minimize disk seeks and maximize global throughput. - The scheduler virtualizes block device access among multiple outstanding requests.
- To optimize disk head movement, I/O schedulers execute two primary operations:
- Merging: Coalescing two or more adjacent on-disk requests into a single hardware command, reducing overhead and eliminating intermediate seeks.
- Sorting: Ordering the request queue by physical disk sector.
- This sector-based sorting algorithm mimics an elevator, gracefully moving the disk head sequentially in a single direction.
Different operational workloads require varying strategies to balance request latency against maximum global throughput, resulting in multiple scheduler implementations.
I/O Scheduler Algorithms
The Linus Elevator
- The default I/O scheduler in the 2.4 kernel.
- Evaluates new requests against all pending requests to perform front merging (inserting immediately before an existing request) or back merging (inserting immediately after).
- Maintains the queue in strict sector-sorted order to preserve sequential hardware traversal.
- Implements a basic starvation prevention mechanism by forcing requests older than a predefined threshold to the tail of the queue.
- This age check is inefficient and frequently results in read request starvation.
The Deadline I/O Scheduler
- Engineered to resolve the “writes starving reads” pathology inherent in the Linus Elevator.
- Read requests are synchronous, causing the application to block and compounding read latency across sequential reads.
- Write requests are asynchronous, making write latency less critical to application execution.
- Maintains three distinct request queues:
- Sorted Queue: Sorted by physical on-disk sector (conceptually similar to the Linus Elevator).
- Read FIFO Queue: Sorted by time, enforcing a default expiration limit of milliseconds.
- Write FIFO Queue: Sorted by time, enforcing a default expiration limit of seconds.
- Normal operation dispatches requests from the head of the Sorted Queue to minimize seeks.
- If a request at the head of either FIFO queue expires, the scheduler overrides the Sorted Queue and immediately services the expired FIFO queue.
- Prioritizes read latency guarantees at the necessary expense of some global throughput.
The Anticipatory I/O Scheduler
- Builds upon the structural foundation of the Deadline I/O scheduler, utilizing the identical three-queue architecture and expiration limits.
- Resolves the global throughput penalty caused by read-induced seek storms during heavy write activity.
- Implements an anticipation heuristic: after fulfilling a read request, the scheduler pauses for a brief window (default milliseconds) rather than immediately seeking back to pending writes.
- If the application issues an adjacent read request during this timeout window, the scheduler services it immediately, saving a costly pair of disk seeks.
- Tracks per-process block I/O statistics to accurately anticipate filesystem and application behavior, significantly boosting global throughput while maintaining low read latency.
The Complete Fair Queuing (CFQ) I/O Scheduler
- The default I/O scheduler for block devices in modern Linux kernels.
- Assigns incoming I/O requests to specific, per-process queues.
- Performs coalescing and sector-sorting strictly within each individual process queue.
- Services queues in a round-robin fashion, dispatching a configurable number of requests (default ) from each queue per cycle.
- Ensures fair distribution of disk bandwidth across all processes, providing ideal performance for multimedia workloads requiring consistent data streams.
The Noop I/O Scheduler
- Performs request merging but completely disables sorting and seek-prevention logic.
- Maintains the request queue in strict FIFO order.
- Targeted exclusively for true random-access block devices (such as flash memory cards) where physical seek overhead is non-existent.
These distinct algorithms collectively synthesize logical software I/O instructions into highly optimized physical hardware execution, controlled dynamically at boot via parameters like elevator=cfq.