The Block I/O Layer

Device Classifications

  • Block devices: Hardware components accessed randomly via fixed-size chunks of data called blocks.
    • Examples include hard drives, Blu-ray readers, and flash memory.
    • Filesystems serve as the standard interface for block devices.
  • Character devices: Hardware components accessed sequentially as a continuous stream of bytes.
    • Examples include serial ports and keyboards.
  • Block devices require a dedicated management subsystem because random access necessitates complex navigation across the media, presenting opportunities for critical performance optimizations.

To interface with physical media, the block I/O layer must map hardware-level addresses to software-level abstractions.

Sectors and Blocks

  • Sector: The smallest addressable unit of physical hardware.
    • Sector sizes are powers of two, typically bytes, though sizes like kilobytes exist for CD-ROMs.
    • Hardware-level sectors are often referred to as device blocks or hard sectors.
  • Block: The smallest addressable unit utilized by the filesystem and kernel.
    • Block sizes must be a multiple of the sector size and a power of two.
    • Block sizes cannot exceed the system page size.
    • Common block sizes are bytes, kilobyte, and kilobytes.
    • Software-level blocks are often referred to as filesystem blocks or I/O blocks.

Once logical blocks are retrieved from physical sectors, they require an in-memory descriptor to manage their state and location.

Buffers and Buffer Heads

  • Buffer: The in-memory representation of a single disk block.
    • A single memory page can hold one or more buffers, depending on the block size relative to the page size.
  • Buffer Head (struct buffer_head): The descriptor structure that maps a specific in-memory buffer to its corresponding on-disk physical block.
    • Defined in <linux/buffer_head.h>.
    • b_state: Bit flags indicating buffer status, such as BH_Uptodate (contains valid data) or BH_Dirty (memory data is newer than disk data, requiring writeback).
    • b_count: Usage reference counter, manipulated safely via get_bh() and put_bh().
    • b_blocknr: The logical block number on the physical device.
    • b_page: Pointer to the physical memory page hosting the buffer.
    • b_data: Pointer to the exact byte offset within b_page where the block data begins.
  • In kernel versions prior to 2.6, buffer heads were the primary container for all block I/O operations.
  • This legacy approach forced the kernel to break large I/O operations into fragmented, block-sized chunks, creating significant overhead.

To overcome the inefficiency of forcing all I/O operations into single-block buffer head containers, the kernel introduced a lightweight structure for active operations.

The bio Structure

  • struct bio: The primary container for in-flight (active) block I/O operations in the 2.6 kernel.
    • Defined in <linux/bio.h>.
    • Represents I/O operations as a list of segments, enabling scatter-gather I/O where data originates from multiple non-contiguous physical pages.
  • Segments (struct bio_vec): Represents a contiguous chunk of a buffer in memory.
    • Acts as a vector mapping <page, offset, len>.
    • bv_page: Pointer to the physical page.
    • bv_offset: Byte offset within the page.
    • bv_len: Length of the buffer in bytes.
  • Key bio Data Fields:
    • bi_io_vec: Array of bio_vec structures comprising the entire I/O operation.
    • bi_vcnt: Total number of vectors in the bio_vec array.
    • bi_idx: Current execution index in the bio_vec array.
    • bi_cnt: Usage reference counter, safely incremented by bio_get() and decremented by bio_put().
  • The bi_idx field allows a single bio structure to be split mid-operation.
    • RAID drivers can duplicate a bio structure and update bi_idx to distribute the operation across multiple physical disks seamlessly.
  • The bio structure offers significant architectural advantages over the legacy buffer head approach.
    • Seamlessly handles high memory using physical pages instead of direct memory pointers.
    • Supports both standard page cache I/O and direct I/O (bypassing the page cache).
    • Contains only requisite I/O state information, minimizing structure footprint.

Buffer heads remain necessary to map individual disk blocks to memory pages, but active I/O tasks are now encapsulated within bio structures and queued for dispatch.

Request Queues

  • struct request_queue: Maintains a doubly linked list of pending block I/O requests.
    • Defined in <linux/blkdev.h>.
    • Block device drivers continuously pull requests from the head of the queue for submission to the physical hardware.
  • struct request: Represents an individual item on the request queue.
    • A single request can contain multiple bio structures if they operate on consecutive sectors on the disk, even if the memory segments themselves are discontinuous.

Because physical hardware seek operations are highly latent, requests in this queue cannot be dispatched indiscriminately to the disk.

I/O Schedulers

  • The I/O scheduler manages the request_queue to minimize disk seeks and maximize global throughput.
  • The scheduler virtualizes block device access among multiple outstanding requests.
  • To optimize disk head movement, I/O schedulers execute two primary operations:
    • Merging: Coalescing two or more adjacent on-disk requests into a single hardware command, reducing overhead and eliminating intermediate seeks.
    • Sorting: Ordering the request queue by physical disk sector.
    • This sector-based sorting algorithm mimics an elevator, gracefully moving the disk head sequentially in a single direction.

Different operational workloads require varying strategies to balance request latency against maximum global throughput, resulting in multiple scheduler implementations.

I/O Scheduler Algorithms

The Linus Elevator

  • The default I/O scheduler in the 2.4 kernel.
  • Evaluates new requests against all pending requests to perform front merging (inserting immediately before an existing request) or back merging (inserting immediately after).
  • Maintains the queue in strict sector-sorted order to preserve sequential hardware traversal.
  • Implements a basic starvation prevention mechanism by forcing requests older than a predefined threshold to the tail of the queue.
  • This age check is inefficient and frequently results in read request starvation.

The Deadline I/O Scheduler

  • Engineered to resolve the “writes starving reads” pathology inherent in the Linus Elevator.
    • Read requests are synchronous, causing the application to block and compounding read latency across sequential reads.
    • Write requests are asynchronous, making write latency less critical to application execution.
  • Maintains three distinct request queues:
    • Sorted Queue: Sorted by physical on-disk sector (conceptually similar to the Linus Elevator).
    • Read FIFO Queue: Sorted by time, enforcing a default expiration limit of milliseconds.
    • Write FIFO Queue: Sorted by time, enforcing a default expiration limit of seconds.
  • Normal operation dispatches requests from the head of the Sorted Queue to minimize seeks.
  • If a request at the head of either FIFO queue expires, the scheduler overrides the Sorted Queue and immediately services the expired FIFO queue.
  • Prioritizes read latency guarantees at the necessary expense of some global throughput.

The Anticipatory I/O Scheduler

  • Builds upon the structural foundation of the Deadline I/O scheduler, utilizing the identical three-queue architecture and expiration limits.
  • Resolves the global throughput penalty caused by read-induced seek storms during heavy write activity.
  • Implements an anticipation heuristic: after fulfilling a read request, the scheduler pauses for a brief window (default milliseconds) rather than immediately seeking back to pending writes.
  • If the application issues an adjacent read request during this timeout window, the scheduler services it immediately, saving a costly pair of disk seeks.
  • Tracks per-process block I/O statistics to accurately anticipate filesystem and application behavior, significantly boosting global throughput while maintaining low read latency.

The Complete Fair Queuing (CFQ) I/O Scheduler

  • The default I/O scheduler for block devices in modern Linux kernels.
  • Assigns incoming I/O requests to specific, per-process queues.
  • Performs coalescing and sector-sorting strictly within each individual process queue.
  • Services queues in a round-robin fashion, dispatching a configurable number of requests (default ) from each queue per cycle.
  • Ensures fair distribution of disk bandwidth across all processes, providing ideal performance for multimedia workloads requiring consistent data streams.

The Noop I/O Scheduler

  • Performs request merging but completely disables sorting and seek-prevention logic.
  • Maintains the request queue in strict FIFO order.
  • Targeted exclusively for true random-access block devices (such as flash memory cards) where physical seek overhead is non-existent.

These distinct algorithms collectively synthesize logical software I/O instructions into highly optimized physical hardware execution, controlled dynamically at boot via parameters like elevator=cfq.