File System Architecture and Implementation

The file system provides data organization, structural sharing among applications, and persistence across reboots. It is constructed as a seven-layer hierarchy, where each level abstracts the complexities of the layer below it, culminating in a uniform resource interface for user processes.

Disk Layout

Data resides on a block device structured into distinct physical sectors:

Block 0: Boot sector (unused by the file system)
Block 1: Superblock containing filesystem metadata (total blocks, data blocks, inode count, log size)
Log blocks: Designated area for transaction records
Inode blocks: Packed arrays of file metadata structures
Bitmap blocks: Bit array tracking free/allocated status of data blocks
Data blocks: File contents and directory entries

Managing these raw disk blocks efficiently requires an intermediate caching mechanism to reduce slow disk accesses and ensure that only one thread modifies a block at a time.

Buffer Cache Layer

The buffer cache synchronizes disk block access and caches frequently used blocks in memory.

Structure: Maintained as a fixed-size, doubly-linked list of struct buf elements.
Buffer State: Each buffer contains a valid flag (indicating data is loaded from disk) and a disk flag (indicating data has been handed to the disk).
Concurrency Control:
- bcache.lock: A global spinlock protecting the cache metadata (which blocks are cached).
- Buffer Sleep-Lock: A per-buffer lock ensuring exclusive access to the block’s content.
Operations:
- bread: Retrieves a locked buffer. If uncached, it allocates a buffer and reads via the disk driver.
- bwrite: Writes a modified buffer to the disk.
- brelse: Releases the buffer lock and moves the buffer to the front of the list, maintaining a Least Recently Used (LRU) order.
Eviction: bget scans the linked list in reverse to find the least recently used buffer with $re f c n t == 0$ to recycle for new blocks.

While the buffer cache ensures exclusive block access in memory, the file system must also guarantee that sequences of block modifications remain atomic on disk in the event of hardware crashes.

Logging Layer (Crash Recovery)

Crash recovery prevents structural inconsistencies (e.g., orphaned allocations) by wrapping multi-block updates into atomic transactions.

Log Structure: Resides in a fixed on-disk location. Consists of a header block (containing an array of sector numbers and a block count) followed by the logged data blocks.
Transaction Lifecycle:
- begin_op: Waits for unreserved log space and halts commits. Reserves space based on a maximum block limit per system call.
- log_write: Acts as a proxy for bwrite. Pins the buffer in the cache and logs the block. Performs absorption (merging multiple writes to the same block into a single log slot).
- end_op: Decrements the count of outstanding operations. Triggers a commit if the count reaches $0$ .
Commit Sequence:
1. write_log: Copies modified blocks from the buffer cache to the log on disk.
2. write_head: Writes the log header with $co u n t > 0$ . This is the definitive atomic commit point.
3. install_trans: Copies logged blocks from the log to their final physical disk locations.
4. Clears the log header to $co u n t = 0$ .
Recovery: During boot, recover_from_log checks the header. If $co u n t > 0$ , it replays install_trans. If $co u n t = 0$ , the log is ignored.
Group Commit: Multiple concurrent system calls can pool their writes into a single transaction, amortizing the fixed disk overhead of the commit process.

Logging safely wraps all structural modifications, the most fundamental of which is allocating and freeing the raw disk blocks that hold file contents.

Block Allocator

The block allocator manages physical data space via an on-disk bitmap where a $1$ bit indicates an allocated block and a $0$ bit indicates a free block.

Allocation (balloc): Iterates from block $0$ to the maximum filesystem size, reading bitmap blocks into the buffer cache. It searches for a $0$ bit, sets it to $1$ , and returns the block number.
Deallocation (bfree): Reads the corresponding bitmap block and clears the targeted bit to $0$ .
Synchronization: Relies entirely on the buffer cache’s exclusive buffer locks and must be invoked within an active logging transaction.

With raw block allocation secured by the bitmap and logging layer, these physical blocks are organized into logical files managed by inodes.

Inode Layer

Inodes represent distinct files, directories, and special devices independent of their filenames.

On-Disk Inode (struct dinode):
- Fixed-size structures located in designated inode blocks.
- Fields: type (file, dir, device; $0$ denotes unallocated), nlink (number of directory entries referencing it), size (bytes), and addrs (array of physical block numbers).
In-Memory Inode (struct inode):
- Maintained in the itable cache only if referenced by active C pointers.
- Fields: Contains a copy of the on-disk data, plus a ref count (active pointers) and a sleep-lock.
Lifecycle and Synchronization:
- itable.lock: Global lock ensuring an inode appears in memory exactly once.
- iget: Returns a non-exclusive pointer to an inode and increments ref. Does not lock the inode.
- ilock: Acquires the inode’s sleep-lock and loads metadata from the buffer cache if necessary.
- iunlock: Releases the sleep-lock.
- iput: Decrements the ref count. If $re f == 0$ and $n l ink == 0$ , the inode is deleted: it truncates the file blocks, sets type to $0$ , and updates the disk.

The inode metadata structurally defines a file, but the actual file content must be mapped to specific physical disk blocks.

Inode Content and Data Mapping

File contents are mapped to physical disk blocks using the addrs array within the inode.

Structure:
- Direct blocks: The first NDIRECT array entries map directly to disk blocks.
- Indirect block: The next array entry points to a physical block containing NINDIRECT block numbers.
- Maximum file capacity is NDIRECT + NINDIRECT blocks.
Translation (bmap): Translates logical block offsets to physical disk sector numbers. Dynamically allocates new data blocks (and the indirect block) via balloc if an accessed entry is $0$ .
I/O Operations:
- readi / writei: Utilize bmap to map offsets, iterating through the buffer cache to copy bytes between kernel buffers and user memory.
- writei expands the file’s size attribute if writes extend past the current EOF.
Truncation (itrunc): Iterates through direct blocks, the indirect block array, and the indirect block itself, freeing each via bfree.

Just as inodes abstract physical data blocks into continuous files, directories abstract isolated inodes into a named hierarchical structure.

Directory Layer

Directories are structurally identical to files but possess a type of T_DIR and adhere to a strict internal formatting.

Content: Contains an array of struct dirent records.
Directory Entry (dirent): Consists of an inum (inode number) and a name (null-terminated string up to DIRSIZ characters). An $in u m = 0$ designates a free entry.
Operations:
- dirlookup: Scans entries for a matching name. Returns an unlocked inode pointer (via iget) and the byte offset of the entry. Returning the inode unlocked avoids deadlocks when looking up ..
- dirlink: Scans for an unallocated entry and writes a new name and inum. Yields an error if the name already exists.

Directories provide single-level naming, but users interact with deep hierarchical paths that require iterative evaluation.

Pathname Layer

The pathname layer recursively resolves string paths (e.g., /usr/bin/cat) into target inodes.

Core Functions: namei returns the target inode. nameiparent returns the parent directory’s inode and the final string element.
Evaluation (namex):
- Starts at the root inode (if the path begins with /) or the current working directory inode.
- Iterates through each path element, executing dirlookup to fetch the next inode.
Concurrency and Deadlock Prevention:
- Locks directories strictly one at a time.
- Sequence: Lock current directory $\to$ lookup next component $\to$ acquire unlocked next inode via iget $\to$ unlock current directory $\to$ lock next inode.
- The continuous ref count held by iget prevents the next inode from being deleted by concurrent unlinks during the lock transition.

Pathnames successfully identify specific files and directories, which are then abstracted alongside other I/O streams into file descriptors for user processes.

File Descriptor Layer

File descriptors create a uniform interface for disparate resources including files, pipes, and devices.

File Object (struct file): Wraps an underlying inode or pipe. Contains an I/O offset, a ref count, and readable/writable flags.
Global File Table (ftable):
- Tracks all open files globally.
- filealloc: Locates an unreferenced file slot ( $re f == 0$ ).
- filedup / fileclose: Adjust the reference count. fileclose drops the underlying pipe or inode (via iput) when the count hits $0$ .
- fileread / filewrite: Validate permissions, route the call to the inode or pipe implementation, and advance the I/O offset. Concurrent offset updates are implicitly serialized by the underlying inode sleep-lock.
System Call Integration:
- Higher-level calls like sys_link and create (which handles open, mkdir, and mkdev) combine directory layer lookups, inode allocation, and link creation.
- By executing these complex, multi-inode logic sequences entirely within logging transactions (begin_op to end_op), the file descriptor layer ensures that multi-block updates (like creating a file and writing its directory entry) remain perfectly atomic without complex partial-failure handling.

My Knowledge Base

Explorer

08 File system