06 Persistence File Systems

I. Files, Directories, and the File System API

Files: A linear array of bytes persistently stored and uniquely identified by a low-level name, historically termed an inode number.
Directories: A specific file type containing an array of mappings from human-readable string names to low-level inode numbers.
- Directories are arranged in a hierarchical tree originating at the root (/).
- Every directory inherently contains entries for itself (.) and its parent (..).
Core Access Methods:
- open(): Initializes file access, taking flags such as O_CREAT (create if non-existent), O_WRONLY (write-only), and O_TRUNC (truncate to zero bytes). Returns a file descriptor serving as an opaque handle/capability for subsequent operations.
- read() / write(): Access file contents sequentially. The OS tracks the current offset per file descriptor, implicitly incrementing it by the bytes processed.
- lseek(): Modifies the tracked offset for random access, avoiding sequential processing. Does not initiate mechanical disk seeks itself.
- fsync(): Bypasses memory buffers to force unwritten (dirty) file data and metadata directly to persistent storage, guaranteeing durability for applications like databases.
- rename(): Atomically alters a file’s name. Exploited to safely update file contents by writing to a temporary file, fsyncing it, and renameing it over the original.
- stat(): Retrieves file metadata (e.g., size, permissions, modification times, inode number, link count) populated from the underlying inode.
File Deletion and Links:
- Hard Links (link()): Maps a new human-readable name to an existing inode number.
- Deletion (unlink()): Removes the name-to-inode mapping and decrements the inode’s reference (link) count. Storage blocks are only freed when the reference count reaches zero.
- Symbolic/Soft Links: An independent file containing the pathname of the target file. Removing the target leaves a dangling reference.
Assembly: mkfs establishes an empty file system structure on a partition. mount() unifies multiple isolated file systems into a single directory tree at a designated mount point.

The abstraction of files and directories necessitates specialized on-disk data structures to map user-level names to physical storage blocks.

II. Very Simple File System (VSFS) Implementation

On-Disk Organization: The disk is divided into uniform fixed-size blocks (e.g., 4 KB).
- Data Region: The vast majority of blocks, allocated for user data.
- Inode Table: An array of index nodes (inodes) storing metadata and tracking data block locations. Given an inode number (inumber), its physical location is calculated via: $sec t or = \frac{( in u mb er \times s i zeo f ( in o d e )) + in o d e St a r t A dd r}{sec t or S i ze}$ .
- Allocation Bitmaps: Separate data and inode bitmaps track free (0) and allocated (1) states.
- Superblock: Stores global file system metadata, including total inode/data block counts, structural offsets, and a magic number identifying the file system type.
The Multi-Level Index: Inodes utilize an imbalanced tree structure optimized for the empirical reality that most files are small.
- Direct pointers directly map to data blocks (e.g., 12 pointers addressing 48 KB).
- Indirect, double-indirect, and triple-indirect pointers point to blocks containing arrays of further pointers, scaling to support massive file sizes.
Access Paths:
- Reading: Demands extensive path traversal. Opening /foo/bar requires fetching the root inode, root directory data, foo inode, foo directory data, and finally bar’s inode.
- Writing/Creation: Highly I/O intensive. Allocating a new file requires reading/writing the inode bitmap, the new inode, directory data, the parent directory inode, and the data bitmap.
Caching and Buffering:
- Reads are filtered through a dynamic page cache (unified with virtual memory) to bypass repetitive disk I/O.
- Writes are buffered in memory to enable batching, scheduling optimization, and absorption of short-lived file deletions.

Caching mitigates read overheads, but unoptimized block allocation leads to severe performance degradation due to physical disk mechanics.

III. Locality and the Fast File System (FFS)

The Fragmentation Problem: Early file systems allocated blocks randomly across the disk, reducing logically sequential accesses to physically random, seek-heavy patterns resulting in utilizing only 2% of disk bandwidth.
Cylinder/Block Groups: FFS subdivides the disk into groups of consecutive tracks or blocks. Each group replicates the superblock and contains localized bitmaps, inodes, and data blocks.
Placement Heuristics: FFS leverages groups to maintain spatial locality.
- Directories are placed in groups with low directory counts and high free inode counts.
- File data blocks are allocated in the same group as the file’s inode.
- Files residing in the same directory are grouped together based on the observation of strong namespace locality.
The Large-File Exception: A massive file could consume an entire group, breaking locality for subsequent files. FFS chunks large files across multiple groups (e.g., redirecting indirect blocks).
- Amortization: Chunk sizes must be large enough that data transfer time eclipses seek positioning time. To achieve effective bandwidth ( $R_{e ff ec t i v e}$ ) equal to a fraction ( $F$ ) of peak transfer rate ( $R_{p e ak}$ ), the transfer size $D$ is calculated as: $D = \frac{F \times R _{p e ak} \times T _{p os i t i o n}}{1 - F}$ .
Sub-blocks: 4 KB blocks transfer efficiently but waste space (internal fragmentation) for tiny files. FFS dynamically allocates 512-byte sub-blocks for small files, assembling them into a 4 KB block once the file grows.

While cylinder groups optimize spatial locality during standard operations, power loss during multi-block metadata updates threatens file system structural integrity.

IV. Crash Consistency: FSCK and Journaling

The Crash Consistency Problem: Single logical operations (e.g., appending data) require discrete disk writes to the inode, bitmap, and data block. Crashes between writes yield inconsistencies like space leaks (bitmap allocated, no inode pointer) or pointing to garbage data (inode updated, data block unwritten).
File System Checker (FSCK): A post-crash repair tool that scans the entire drive to rectify inconsistencies. It verifies superblocks, regenerates bitmaps from inodes, and corrects link counts. FSCK is prohibitively slow for modern disk capacities.
Write-Ahead Logging (Journaling): A transaction is recorded sequentially to an on-disk log before updating the main file system structures in-place.
- Data Journaling: Logs the Transaction Begin (TxB), physical metadata, data, and Transaction End (TxE). TxE is written only after prior blocks complete to prevent replaying a partially written transaction over valid data. After TxE commits, updates are checkpointed to final locations.
- Metadata (Ordered) Journaling: To avoid halving write throughput by writing data twice, data blocks are written directly to their final location before logging the associated metadata. This prevents metadata from pointing to garbage data upon crash recovery.
Block Reuse Constraints: If a directory is deleted, its block freed, and a new file allocated on that block, replaying the journal could overwrite the new file with old directory data. Revoke records are inserted into the journal upon deletion to instruct the recovery process to skip replaying those blocks.
Circular Log: The journal is bounded; space is reclaimed periodically by updating a journal superblock to advance the log head past checkpointed transactions.

Journaling ensures structural consistency but still incurs high seek overheads for in-place overwrites, prompting designs that abandon in-place updates entirely.

V. Log-Structured File Systems (LFS)

Motivation: Increasing system memory satisfies most read requests, leaving writes as the primary disk bottleneck. Sequential writes vastly outperform random writes.
Segments and Sequential Writing: LFS transforms all updates into sequential operations by buffering data and metadata in memory, writing them as a single large sequential segment to an unused disk location. In-place overwrites are never performed.
The Inode Map (Imap): Because inodes constantly shift locations when updated, a level of indirection is introduced. The imap dynamically translates fixed inode numbers to their latest physical disk addresses. Imap chunks are written alongside the data and inodes they describe.
The Checkpoint Region (CR): A static disk location pointing to the most recent imap chunks. The CR is updated atomically (using dual alternating regions and timestamps) to tolerate crashes during CR updates.
Recursive Update Avoidance: Because directories map names to static inode numbers rather than physical disk addresses, modifying a file does not trigger a cascading sequence of updates up the directory tree.
Garbage Collection: Writing to new locations orphans old versions of blocks, leaving fragmented garbage. LFS employs a cleaner that reads old segments, isolates live blocks, compacts them into new segments, and frees the old segments for sequential writing.
Liveness Tracking: Each segment includes a Segment Summary Block mapping each data block to its parent (inode number, offset). The cleaner queries the imap and inode to verify if the file’s current pointer matches the block’s physical address. If identical, the block is live; otherwise, it is garbage. Version numbers optimize this check.
Crash Recovery: LFS reads the latest CR to locate the imap and rolls forward through subsequent log segments to recover transactions completed since the last checkpoint.

Sequential writing mechanisms maximize disk bandwidth, but physical storage hardware itself introduces silent data corruption that requires distinct protection mechanisms.

VI. Data Integrity and Protection

Partial Failure Modes: Beyond absolute disk failure (fail-stop), modern disks suffer from Latent Sector Errors (LSEs, inaccessible blocks raising hard errors) and Block Corruption (silent faults returning bad data).
Handling LSEs: Easily detectable by the disk drive itself via internal ECC. Recovered using standard RAID redundancy (mirroring or parity reconstruction).
Detecting Corruption: Storage systems implement checksums to generate a compact functional summary of a block. Checksums (XOR, Fletcher, CRC) negotiate the trade-off between computational cost and collision resistance.
- Layout: Checksums are stored using custom formatting (e.g., 520-byte sectors) or packed into dedicated 512-byte sectors interspersed with data blocks.
- Verification: A block is corrupted if the stored checksum $C_{s} (D)$ does not match the newly computed checksum $C_{c} (D)$ .
Misdirected Writes: Occur when firmware writes valid data to an incorrect disk sector. Addressed by injecting a physical identifier (disk ID and block offset) into the checksum block for verification upon read.
Lost Writes: Occur when the disk acknowledges a write but fails to commit it, leaving older, validly-checksummed data in place. Mitigated via expensive read-after-write verification, or by storing data checksums redundantly within the file’s inode (e.g., ZFS).
Scrubbing: Checksums are normally only evaluated upon access. Background scrubbing processes periodically scan all blocks to proactively detect and correct bit rot before redundant copies fail simultaneously.
Overheads: Induces minimal space overhead (e.g., 0.19% for an 8-byte checksum per 4 KB block). Computational time overheads are mitigated by coalescing checksum calculation directly into mandatory memory-copying routines.

My Knowledge Base

Explorer