File System Architecture and Implementation
The file system provides data organization, structural sharing among applications, and persistence across reboots. It is constructed as a seven-layer hierarchy, where each level abstracts the complexities of the layer below it, culminating in a uniform resource interface for user processes.
Disk Layout
Data resides on a block device structured into distinct physical sectors:
- Block 0: Boot sector (unused by the file system)
- Block 1: Superblock containing filesystem metadata (total blocks, data blocks, inode count, log size)
- Log blocks: Designated area for transaction records
- Inode blocks: Packed arrays of file metadata structures
- Bitmap blocks: Bit array tracking free/allocated status of data blocks
- Data blocks: File contents and directory entries
Managing these raw disk blocks efficiently requires an intermediate caching mechanism to reduce slow disk accesses and ensure that only one thread modifies a block at a time.
Buffer Cache Layer
The buffer cache synchronizes disk block access and caches frequently used blocks in memory.
- Structure: Maintained as a fixed-size, doubly-linked list of
struct bufelements. - Buffer State: Each buffer contains a
validflag (indicating data is loaded from disk) and adiskflag (indicating data has been handed to the disk). - Concurrency Control:
bcache.lock: A global spinlock protecting the cache metadata (which blocks are cached).- Buffer Sleep-Lock: A per-buffer lock ensuring exclusive access to the block’s content.
- Operations:
bread: Retrieves a locked buffer. If uncached, it allocates a buffer and reads via the disk driver.bwrite: Writes a modified buffer to the disk.brelse: Releases the buffer lock and moves the buffer to the front of the list, maintaining a Least Recently Used (LRU) order.
- Eviction:
bgetscans the linked list in reverse to find the least recently used buffer with to recycle for new blocks.
While the buffer cache ensures exclusive block access in memory, the file system must also guarantee that sequences of block modifications remain atomic on disk in the event of hardware crashes.
Logging Layer (Crash Recovery)
Crash recovery prevents structural inconsistencies (e.g., orphaned allocations) by wrapping multi-block updates into atomic transactions.
- Log Structure: Resides in a fixed on-disk location. Consists of a header block (containing an array of sector numbers and a block count) followed by the logged data blocks.
- Transaction Lifecycle:
begin_op: Waits for unreserved log space and halts commits. Reserves space based on a maximum block limit per system call.log_write: Acts as a proxy forbwrite. Pins the buffer in the cache and logs the block. Performs absorption (merging multiple writes to the same block into a single log slot).end_op: Decrements the count of outstanding operations. Triggers a commit if the count reaches .
- Commit Sequence:
write_log: Copies modified blocks from the buffer cache to the log on disk.write_head: Writes the log header with . This is the definitive atomic commit point.install_trans: Copies logged blocks from the log to their final physical disk locations.- Clears the log header to .
- Recovery: During boot,
recover_from_logchecks the header. If , it replaysinstall_trans. If , the log is ignored. - Group Commit: Multiple concurrent system calls can pool their writes into a single transaction, amortizing the fixed disk overhead of the commit process.
Logging safely wraps all structural modifications, the most fundamental of which is allocating and freeing the raw disk blocks that hold file contents.
Block Allocator
The block allocator manages physical data space via an on-disk bitmap where a bit indicates an allocated block and a bit indicates a free block.
- Allocation (
balloc): Iterates from block to the maximum filesystem size, reading bitmap blocks into the buffer cache. It searches for a bit, sets it to , and returns the block number. - Deallocation (
bfree): Reads the corresponding bitmap block and clears the targeted bit to . - Synchronization: Relies entirely on the buffer cache’s exclusive buffer locks and must be invoked within an active logging transaction.
With raw block allocation secured by the bitmap and logging layer, these physical blocks are organized into logical files managed by inodes.
Inode Layer
Inodes represent distinct files, directories, and special devices independent of their filenames.
- On-Disk Inode (
struct dinode):- Fixed-size structures located in designated inode blocks.
- Fields:
type(file, dir, device; denotes unallocated),nlink(number of directory entries referencing it),size(bytes), andaddrs(array of physical block numbers).
- In-Memory Inode (
struct inode):- Maintained in the
itablecache only if referenced by active C pointers. - Fields: Contains a copy of the on-disk data, plus a
refcount (active pointers) and a sleep-lock.
- Maintained in the
- Lifecycle and Synchronization:
itable.lock: Global lock ensuring an inode appears in memory exactly once.iget: Returns a non-exclusive pointer to an inode and incrementsref. Does not lock the inode.ilock: Acquires the inode’s sleep-lock and loads metadata from the buffer cache if necessary.iunlock: Releases the sleep-lock.iput: Decrements therefcount. If and , the inode is deleted: it truncates the file blocks, setstypeto , and updates the disk.
The inode metadata structurally defines a file, but the actual file content must be mapped to specific physical disk blocks.
Inode Content and Data Mapping
File contents are mapped to physical disk blocks using the addrs array within the inode.
- Structure:
- Direct blocks: The first
NDIRECTarray entries map directly to disk blocks. - Indirect block: The next array entry points to a physical block containing
NINDIRECTblock numbers. - Maximum file capacity is
NDIRECT+NINDIRECTblocks.
- Direct blocks: The first
- Translation (
bmap): Translates logical block offsets to physical disk sector numbers. Dynamically allocates new data blocks (and the indirect block) viaballocif an accessed entry is . - I/O Operations:
readi/writei: Utilizebmapto map offsets, iterating through the buffer cache to copy bytes between kernel buffers and user memory.writeiexpands the file’ssizeattribute if writes extend past the current EOF.
- Truncation (
itrunc): Iterates through direct blocks, the indirect block array, and the indirect block itself, freeing each viabfree.
Just as inodes abstract physical data blocks into continuous files, directories abstract isolated inodes into a named hierarchical structure.
Directory Layer
Directories are structurally identical to files but possess a type of T_DIR and adhere to a strict internal formatting.
- Content: Contains an array of
struct direntrecords. - Directory Entry (
dirent): Consists of aninum(inode number) and aname(null-terminated string up toDIRSIZcharacters). An designates a free entry. - Operations:
dirlookup: Scans entries for a matching name. Returns an unlocked inode pointer (viaiget) and the byte offset of the entry. Returning the inode unlocked avoids deadlocks when looking up..dirlink: Scans for an unallocated entry and writes a newnameandinum. Yields an error if the name already exists.
Directories provide single-level naming, but users interact with deep hierarchical paths that require iterative evaluation.
Pathname Layer
The pathname layer recursively resolves string paths (e.g., /usr/bin/cat) into target inodes.
- Core Functions:
nameireturns the target inode.nameiparentreturns the parent directory’s inode and the final string element. - Evaluation (
namex):- Starts at the root inode (if the path begins with
/) or the current working directory inode. - Iterates through each path element, executing
dirlookupto fetch the next inode.
- Starts at the root inode (if the path begins with
- Concurrency and Deadlock Prevention:
- Locks directories strictly one at a time.
- Sequence: Lock current directory lookup next component acquire unlocked next inode via
igetunlock current directory lock next inode. - The continuous
refcount held byigetprevents the next inode from being deleted by concurrent unlinks during the lock transition.
Pathnames successfully identify specific files and directories, which are then abstracted alongside other I/O streams into file descriptors for user processes.
File Descriptor Layer
File descriptors create a uniform interface for disparate resources including files, pipes, and devices.
- File Object (
struct file): Wraps an underlying inode or pipe. Contains an I/O offset, arefcount, andreadable/writableflags. - Global File Table (
ftable):- Tracks all open files globally.
filealloc: Locates an unreferenced file slot ().filedup/fileclose: Adjust the reference count.fileclosedrops the underlying pipe or inode (viaiput) when the count hits .fileread/filewrite: Validate permissions, route the call to the inode or pipe implementation, and advance the I/O offset. Concurrent offset updates are implicitly serialized by the underlying inode sleep-lock.
- System Call Integration:
- Higher-level calls like
sys_linkandcreate(which handlesopen,mkdir, andmkdev) combine directory layer lookups, inode allocation, and link creation. - By executing these complex, multi-inode logic sequences entirely within logging transactions (
begin_optoend_op), the file descriptor layer ensures that multi-block updates (like creating a file and writing its directory entry) remain perfectly atomic without complex partial-failure handling.
- Higher-level calls like