File I/O

File Descriptors

File descriptors act as unique identifiers that map to open files and their associated metadata, such as file positions and access modes.

  • Representation: A file descriptor is an integer of the C int type.
  • Capacity: Descriptors range from up to one less than the per-process maximum limit (default 1,024, configurable up to 1,048,576).
  • Standard Descriptors: By convention, every process opens with three descriptors:
    • 0: Standard Input (STDIN_FILENO)
    • 1: Standard Output (STDOUT_FILENO)
    • 2: Standard Error (STDERR_FILENO)
  • Scope: File tables and descriptors are tracked on a per-process basis, though child processes receive a copy of their parent’s file table upon creation.
  • Universality: Descriptors provide access not just to regular files, but to device files, pipes, futexes, FIFOs, and sockets.

To utilize file descriptors for data access, a target file must first be opened or created to generate a valid descriptor mapping.

Opening Files

The open() and creat() system calls initialize access to a file and return a file descriptor.

  • Syntax: int open(const char *name, int flags, mode_t mode);.
  • Access Modes: The flags argument mandates exactly one of O_RDONLY (read-only), O_WRONLY (write-only), or O_RDWR (read/write).
  • Behavioral Flags: Additional options bitwise-ORed into the flags argument alter file behavior:
    • O_APPEND: Forces the file position to the end-of-file before each write.
    • O_CREAT: Instructs the kernel to create the file if it does not exist.
    • O_TRUNC: Truncates an existing regular file to zero length, assuming write permissions.
    • O_SYNC: Forces synchronous I/O, delaying write completion until data hits the physical disk.
    • O_DIRECT: Bypasses the kernel page cache for direct hardware I/O.
  • File Creation & Permissions: If O_CREAT is passed, the mode argument dictates the permissions of the new file.
    • Permissions are bitwise-ANDed against the complement of the process’s umask.
    • Constants such as S_IRWXU (owner read/write/execute) define the mode portably.
  • creat() Alias: The function creat(name, mode) is a historic shortcut equivalent to calling open(name, O_WRONLY | O_CREAT | O_TRUNC, mode).

Once a file descriptor is obtained via an open call, it serves as the handle for extracting data through read operations.

Reading Data

The read() system call extracts bytes from a file into a user-space buffer.

  • Syntax: ssize_t read(int fd, void *buf, size_t len);.
  • Execution: Reads up to len bytes into memory at buf from the current file position, subsequently advancing the file position by the number of bytes read.
  • Return States:
    • Success: Returns the number of bytes read. A partial read (return value but ) is legal and common.
    • EOF: Returns , indicating the file position has advanced past the end of the data.
    • Blocking: If no data is available (e.g., on a socket), the call sleeps until bytes arrive.
    • Interrupted: Returns and sets errno to EINTR if interrupted by a signal before data is read. The call should be reissued.
    • Nonblocking: If the file was opened with O_NONBLOCK and no data is present, returns with errno set to EAGAIN.
  • Loop Handling: Because partial reads are standard behavior, robust programs must wrap read() in a loop to ensure all requested bytes are retrieved.

Reading extracts data from the descriptor, while the inverse operation pushes data into the system’s buffers.

Writing Data

The write() system call transfers data from a user-space buffer into a file.

  • Syntax: ssize_t write(int fd, const void *buf, size_t count);.
  • Execution: Writes up to count bytes from buf to the current file position, updating the position accordingly.
  • Append Mode Safety: When multiple processes append to the same file (e.g., logging), using O_APPEND during initialization creates an atomic positional update to the end-of-file before each write, mitigating data-corruption race conditions.
  • Deferred Writes (Writeback):
    • write() typically returns immediately after copying data into a kernel buffer.
    • Buffers with newer data than the disk are marked “dirty”.
    • The kernel later asynchronously coalesces and flushes these dirty buffers to disk to optimize hardware performance.
    • Deferred writes risk data loss on power failure and obscure physical I/O errors.

Because write() defers physical disk updates for performance, strict data integrity guarantees require explicit synchronization.

Synchronized and Direct I/O

To circumvent or flush the deferred writeback mechanism, the system provides synchronization interfaces.

  • fsync(): Flushes all dirty data and metadata associated with a file descriptor to the physical disk. Blocks until the hardware confirms completion.
  • fdatasync(): Flushes data but only essential metadata (e.g., file size) required for future access, omitting non-essential metadata like modification timestamps to avoid unnecessary disk seeks.
  • sync(): Commits all global system buffers to disk.
  • O_SYNC Flag: Specified during open(), it forces an implicit synchronization after every write operation, significantly degrading throughput by incurring all I/O wait times directly.
  • Direct I/O (O_DIRECT): Bypasses the kernel page cache entirely. I/O routes directly between user-space buffers and the device. Requires request lengths, file offsets, and buffer alignments to be strict integer multiples of the device sector size.

After I/O and synchronization operations conclude, the descriptor and its associated resources must be retired.

Closing Files

The close() system call terminates the mapping between a file descriptor and its underlying file.

  • Syntax: int close(int fd);.
  • Memory Management: When the final open descriptor for a file is closed, the kernel frees the internal data structure, which unpins the in-memory inode.
  • Unlinking: If a file was deleted (unlinked) while active, physical deletion from the disk is deferred until this final close() executes.
  • Error Handling: Checking the return value is critical, as deferred I/O errors (such as EIO) may only manifest during the close operation.

While active, file descriptors maintain a current position that dictates where linear reads and writes occur; however, non-linear access requires manual offset manipulation.

File Positioning and Seeking

Applications requiring random access manipulate the file position pointer without transferring data.

  • lseek(): Sets the file position.
    • Syntax: off_t lseek(int fd, off_t pos, int origin);.
    • SEEK_SET: Sets offset to pos.
    • SEEK_CUR: Sets offset to current position plus pos.
    • SEEK_END: Sets offset to file length plus pos.
  • Sparse Files: Seeking beyond the end-of-file and writing creates a “hole”. The intervening space reads as zeros but consumes no physical disk blocks.
  • Positional I/O (pread() / pwrite()):
    • Performs I/O at a specific offset without utilizing or modifying the global file position.
    • Eliminates race conditions in multithreaded programs where threads share a single file table and could maliciously overlap lseek() adjustments.

Changing the file offset allows writing beyond the file’s end to expand it, but explicit functions exist to modify the overall file length directly.

Truncating Files

File truncation alters the file size to a specific byte length.

  • Syntax: int ftruncate(int fd, off_t len); and int truncate(const char *path, off_t len);.
  • Shrinking: Discards all data residing beyond len bytes.
  • Expanding: If len is larger than the file, the extended space is padded with zeros.
  • Position: Does not alter the current file position.

Managing single files is straightforward, but juggling I/O across numerous descriptors concurrently necessitates specialized event-driven structures.

Multiplexed I/O

Multiplexing solves the inefficiency of blocking on a single file descriptor when multiple descriptors are active. It allows a process to monitor many descriptors concurrently and awaken only when one is ready for non-blocking I/O.

The select() System Call

  • Structure: Uses three static fd_set bitmasks to monitor read, write, and exception events.
  • Operation: Blocks until an event triggers or a timeval timeout (microsecond precision) expires.
  • Drawbacks:
    • Requires the caller to compute and pass the highest-valued file descriptor plus one ().
    • Destructively modifies the input fd_set masks, requiring reinitialization on every loop iteration.
    • Limited mechanically by FD_SETSIZE (typically 1,024).
  • pselect(): Modern variant using timespec (nanosecond precision) and accepting a sigmask to prevent race conditions with signal handlers.

The poll() System Call

  • Structure: Uses a dynamic array of pollfd structures, specifying the target fd, requested events, and returned revents.
  • Operation: Blocks until an event triggers or a millisecond timeout expires.
  • Advantages over select():
    • Eliminates the need to track the highest-valued file descriptor.
    • Scales efficiently for large-valued descriptors by eschewing sparse bitmasks.
    • Separates input (events) from output (revents), meaning the structure array can be reused in loops without reinitialization.
  • ppoll(): Linux-specific variant incorporating a timespec timeout and signal mask similar to pselect().

All these user-facing I/O interfaces are ultimately mediated by the kernel’s internal storage and caching mechanisms.

Kernel Internals

The Linux kernel implements file I/O utilizing three primary subsystems to ensure hardware independence and optimize performance.

  • Virtual Filesystem (VFS):
    • An abstraction layer defining a common file model (inodes, superblocks) that standardizes behavior across all mounted filesystems.
    • A generic read() maps through the VFS to invoke the specific function pointer associated with the backing filesystem.
  • Page Cache:
    • An in-memory store leveraging temporal locality to cache disk data.
    • Dynamically consumes free RAM to cache files, seamlessly shrinking when the system faces memory pressure.
  • Readahead:
    • Exploits sequential locality by speculatively fetching subsequent contiguous data blocks into the page cache ahead of explicit user requests.
    • The kernel dynamically expands or disables the readahead window based on the process’s access patterns.
  • Page Writeback:
    • Deferred writes live in memory as dirty buffer_head structures within the page cache.
    • Flusher threads awaken to commit data to disk when free memory dips below thresholds or buffers age past dirty_expire_centisecs.