File I/O
File Descriptors
File descriptors act as unique identifiers that map to open files and their associated metadata, such as file positions and access modes.
- Representation: A file descriptor is an integer of the C
inttype. - Capacity: Descriptors range from up to one less than the per-process maximum limit (default 1,024, configurable up to 1,048,576).
- Standard Descriptors: By convention, every process opens with three descriptors:
0: Standard Input (STDIN_FILENO)1: Standard Output (STDOUT_FILENO)2: Standard Error (STDERR_FILENO)
- Scope: File tables and descriptors are tracked on a per-process basis, though child processes receive a copy of their parent’s file table upon creation.
- Universality: Descriptors provide access not just to regular files, but to device files, pipes, futexes, FIFOs, and sockets.
To utilize file descriptors for data access, a target file must first be opened or created to generate a valid descriptor mapping.
Opening Files
The open() and creat() system calls initialize access to a file and return a file descriptor.
- Syntax:
int open(const char *name, int flags, mode_t mode);. - Access Modes: The
flagsargument mandates exactly one ofO_RDONLY(read-only),O_WRONLY(write-only), orO_RDWR(read/write). - Behavioral Flags: Additional options bitwise-ORed into the
flagsargument alter file behavior:O_APPEND: Forces the file position to the end-of-file before each write.O_CREAT: Instructs the kernel to create the file if it does not exist.O_TRUNC: Truncates an existing regular file to zero length, assuming write permissions.O_SYNC: Forces synchronous I/O, delaying write completion until data hits the physical disk.O_DIRECT: Bypasses the kernel page cache for direct hardware I/O.
- File Creation & Permissions: If
O_CREATis passed, themodeargument dictates the permissions of the new file.- Permissions are bitwise-ANDed against the complement of the process’s
umask. - Constants such as
S_IRWXU(owner read/write/execute) define the mode portably.
- Permissions are bitwise-ANDed against the complement of the process’s
creat()Alias: The functioncreat(name, mode)is a historic shortcut equivalent to callingopen(name, O_WRONLY | O_CREAT | O_TRUNC, mode).
Once a file descriptor is obtained via an open call, it serves as the handle for extracting data through read operations.
Reading Data
The read() system call extracts bytes from a file into a user-space buffer.
- Syntax:
ssize_t read(int fd, void *buf, size_t len);. - Execution: Reads up to
lenbytes into memory atbuffrom the current file position, subsequently advancing the file position by the number of bytes read. - Return States:
- Success: Returns the number of bytes read. A partial read (return value but ) is legal and common.
- EOF: Returns , indicating the file position has advanced past the end of the data.
- Blocking: If no data is available (e.g., on a socket), the call sleeps until bytes arrive.
- Interrupted: Returns and sets
errnotoEINTRif interrupted by a signal before data is read. The call should be reissued. - Nonblocking: If the file was opened with
O_NONBLOCKand no data is present, returns witherrnoset toEAGAIN.
- Loop Handling: Because partial reads are standard behavior, robust programs must wrap
read()in a loop to ensure all requested bytes are retrieved.
Reading extracts data from the descriptor, while the inverse operation pushes data into the system’s buffers.
Writing Data
The write() system call transfers data from a user-space buffer into a file.
- Syntax:
ssize_t write(int fd, const void *buf, size_t count);. - Execution: Writes up to
countbytes frombufto the current file position, updating the position accordingly. - Append Mode Safety: When multiple processes append to the same file (e.g., logging), using
O_APPENDduring initialization creates an atomic positional update to the end-of-file before each write, mitigating data-corruption race conditions. - Deferred Writes (Writeback):
write()typically returns immediately after copying data into a kernel buffer.- Buffers with newer data than the disk are marked “dirty”.
- The kernel later asynchronously coalesces and flushes these dirty buffers to disk to optimize hardware performance.
- Deferred writes risk data loss on power failure and obscure physical I/O errors.
Because write() defers physical disk updates for performance, strict data integrity guarantees require explicit synchronization.
Synchronized and Direct I/O
To circumvent or flush the deferred writeback mechanism, the system provides synchronization interfaces.
fsync(): Flushes all dirty data and metadata associated with a file descriptor to the physical disk. Blocks until the hardware confirms completion.fdatasync(): Flushes data but only essential metadata (e.g., file size) required for future access, omitting non-essential metadata like modification timestamps to avoid unnecessary disk seeks.sync(): Commits all global system buffers to disk.O_SYNCFlag: Specified duringopen(), it forces an implicit synchronization after every write operation, significantly degrading throughput by incurring all I/O wait times directly.- Direct I/O (
O_DIRECT): Bypasses the kernel page cache entirely. I/O routes directly between user-space buffers and the device. Requires request lengths, file offsets, and buffer alignments to be strict integer multiples of the device sector size.
After I/O and synchronization operations conclude, the descriptor and its associated resources must be retired.
Closing Files
The close() system call terminates the mapping between a file descriptor and its underlying file.
- Syntax:
int close(int fd);. - Memory Management: When the final open descriptor for a file is closed, the kernel frees the internal data structure, which unpins the in-memory inode.
- Unlinking: If a file was deleted (unlinked) while active, physical deletion from the disk is deferred until this final
close()executes. - Error Handling: Checking the return value is critical, as deferred I/O errors (such as
EIO) may only manifest during the close operation.
While active, file descriptors maintain a current position that dictates where linear reads and writes occur; however, non-linear access requires manual offset manipulation.
File Positioning and Seeking
Applications requiring random access manipulate the file position pointer without transferring data.
lseek(): Sets the file position.- Syntax:
off_t lseek(int fd, off_t pos, int origin);. SEEK_SET: Sets offset topos.SEEK_CUR: Sets offset to current position pluspos.SEEK_END: Sets offset to file length pluspos.
- Syntax:
- Sparse Files: Seeking beyond the end-of-file and writing creates a “hole”. The intervening space reads as zeros but consumes no physical disk blocks.
- Positional I/O (
pread()/pwrite()):- Performs I/O at a specific offset without utilizing or modifying the global file position.
- Eliminates race conditions in multithreaded programs where threads share a single file table and could maliciously overlap
lseek()adjustments.
Changing the file offset allows writing beyond the file’s end to expand it, but explicit functions exist to modify the overall file length directly.
Truncating Files
File truncation alters the file size to a specific byte length.
- Syntax:
int ftruncate(int fd, off_t len);andint truncate(const char *path, off_t len);. - Shrinking: Discards all data residing beyond
lenbytes. - Expanding: If
lenis larger than the file, the extended space is padded with zeros. - Position: Does not alter the current file position.
Managing single files is straightforward, but juggling I/O across numerous descriptors concurrently necessitates specialized event-driven structures.
Multiplexed I/O
Multiplexing solves the inefficiency of blocking on a single file descriptor when multiple descriptors are active. It allows a process to monitor many descriptors concurrently and awaken only when one is ready for non-blocking I/O.
The select() System Call
- Structure: Uses three static
fd_setbitmasks to monitor read, write, and exception events. - Operation: Blocks until an event triggers or a
timevaltimeout (microsecond precision) expires. - Drawbacks:
- Requires the caller to compute and pass the highest-valued file descriptor plus one ().
- Destructively modifies the input
fd_setmasks, requiring reinitialization on every loop iteration. - Limited mechanically by
FD_SETSIZE(typically 1,024).
pselect(): Modern variant usingtimespec(nanosecond precision) and accepting asigmaskto prevent race conditions with signal handlers.
The poll() System Call
- Structure: Uses a dynamic array of
pollfdstructures, specifying the targetfd, requestedevents, and returnedrevents. - Operation: Blocks until an event triggers or a millisecond timeout expires.
- Advantages over
select():- Eliminates the need to track the highest-valued file descriptor.
- Scales efficiently for large-valued descriptors by eschewing sparse bitmasks.
- Separates input (
events) from output (revents), meaning the structure array can be reused in loops without reinitialization.
ppoll(): Linux-specific variant incorporating atimespectimeout and signal mask similar topselect().
All these user-facing I/O interfaces are ultimately mediated by the kernel’s internal storage and caching mechanisms.
Kernel Internals
The Linux kernel implements file I/O utilizing three primary subsystems to ensure hardware independence and optimize performance.
- Virtual Filesystem (VFS):
- An abstraction layer defining a common file model (inodes, superblocks) that standardizes behavior across all mounted filesystems.
- A generic
read()maps through the VFS to invoke the specific function pointer associated with the backing filesystem.
- Page Cache:
- An in-memory store leveraging temporal locality to cache disk data.
- Dynamically consumes free RAM to cache files, seamlessly shrinking when the system faces memory pressure.
- Readahead:
- Exploits sequential locality by speculatively fetching subsequent contiguous data blocks into the page cache ahead of explicit user requests.
- The kernel dynamically expands or disables the readahead window based on the process’s access patterns.
- Page Writeback:
- Deferred writes live in memory as dirty
buffer_headstructures within the page cache. - Flusher threads awaken to commit data to disk when free memory dips below thresholds or buffers age past
dirty_expire_centisecs.
- Deferred writes live in memory as dirty