ftable is the global file-table state.
| Field | Meaning |
|---|---|
lock | Protects file allocation and reference counts. |
file[NFILE] | Fixed array of 100 open file objects. |
Each struct file represents one open file description.
| Field | Meaning |
|---|---|
type | File kind: FD_NONE, FD_PIPE, FD_INODE, or FD_DEVICE. |
ref | Number of references to this open file object. |
readable | Says whether reads are allowed. |
writable | Says whether writes are allowed. |
pipe | Pipe pointer when type == FD_PIPE. |
ip | Inode pointer when type == FD_INODE or FD_DEVICE. |
off | Current file offset for inode files. |
major | Device major number for device files. |
The file table is used later by file descriptor operations. fileinit does not create any open files; it only prepares the lock. Actual struct file objects are claimed later from ftable.file[] by filealloc.
filealloc
- Finds a free
struct fileinftable.file[]by scanning forref == 0. - Sets
ref = 1while holdingftable.lock. - Returns the claimed file object, or
0if allNFILEslots are in use. - Does not set
type,readable,writable,ip,pipe,off, ormajor; - The caller fills those fields.
- Used by
sys_openafter path lookup or file creation has produced an inode. - Used by
pipealloctwice, once for the pipe read end and once for the pipe write end.
filedup
- Increments
f->refwhile holdingftable.lock. - Panics if the file is not currently open (
ref < 1). - Does not allocate a new
struct file; it creates another reference to the same open file object. - Used by
dupwhen a process creates another file descriptor for the same open file. - Used during
forkso the child and parent share the same open file objects.
fileclose
- Decrements
f->refwhile holdingftable.lock. - If other references remain, it returns without closing the underlying object.
- If the last reference is gone, it copies the file object to a local
ff, clears the table slot withref = 0andtype = FD_NONE, then releasesftable.lock. - For
FD_PIPE, callspipeclose(ff.pipe, ff.writable)so the pipe end is closed and sleepers can be woken. - For
FD_INODEandFD_DEVICE, wrapsiput(ff.ip)inbegin_op()andend_op()so the inode reference is released inside a filesystem transaction.
fileread
- Rejects the operation if
f->readable == 0. - For
FD_PIPE, callspiperead(f->pipe, addr, n). - For
FD_DEVICE, validatesf->majorand dispatches todevsw[f->major].read(1, addr, n). - For
FD_INODE, locksf->ip, callsreadi(f->ip, 1, addr, f->off, n), advancesf->offby the bytes read, and unlocks the inode. - Panics if the file type is not one of the valid readable file kinds.
filewrite
- Rejects the operation if
f->writable == 0. - For
FD_PIPE, callspipewrite(f->pipe, addr, n). - For
FD_DEVICE, validatesf->majorand dispatches todevsw[f->major].write(1, addr, n). - For
FD_INODE, writes in chunks small enough to fit the filesystem log transaction limit. - Each inode-file write chunk runs inside
begin_op()andend_op(), locksf->ip, callswritei(f->ip, 1, addr + i, f->off, n1), advancesf->off, and unlocks the inode.
File System Layers Diagram
--- config: layout: dagre --- flowchart RL subgraph MF["memory inode-table side"] direction TB LOCK_FUNCS["inode locking<br><br>ilock<br>iunlock"] ICACHE_FUNCS["inode cache / lifetime<br><br>iinit<br>iget<br>idup<br>iput<br>iunlockput<br>ireclaim"] end subgraph M["memory inode side"] direction LR ICACHE["icache / itable<br><br>active struct inode objects"] INODE["struct inode<br><br>dev<br>inum<br>ref<br>lock<br>valid<br>type<br>major / minor<br>nlink<br>size<br>addrs[]"] SB_MEM["struct superblock sb<br><br>magic<br>size<br>nblocks<br>ninodes<br>nlog<br>logstart<br>inodestart<br>bmapstart"] end subgraph BF["bridge: memory inode to disk fs"] direction TB PATH_FUNCS["pathname layer<br><br>skipelem<br>namex<br>namei<br>nameiparent"] DIR_FUNCS["directory layer<br><br>namecmp<br>dirlookup<br>dirlink"] DATA_FUNCS["inode data I/O<br><br>bmap<br>readi<br>writei"] META_FUNCS["inode metadata<br><br>ialloc<br>iupdate<br>itrunc<br>stati"] end subgraph LF["log / transaction layer"] direction TB LOG_FUNCS["transaction control<br><br>initlog<br>begin_op<br>end_op"] LOG_WRITE_FUNCS["logged write + recovery<br><br>log_write<br>commit<br>write_log<br>write_head<br>install_trans<br>recover_from_log<br>read_head"] end subgraph LM["in-memory log state"] direction LR LOG_STATE["struct log<br><br>lock<br>start<br>dev<br>outstanding<br>committing<br>lh"] LOG_HDR_MEM["log.lh<br><br>n<br>block[]"] end subgraph DF["disk filesystem side"] direction TB SUPER_FUNCS["superblock init<br><br>readsb<br>fsinit"] BLOCK_FUNCS["block allocation<br><br>bzero<br>balloc<br>bfree"] end subgraph BC["buffer cache layer"] direction TB BC_FUNCS["buffer cache API<br><br>binit<br>bread<br>bwrite<br>brelse<br>bpin<br>bunpin"] BCACHE["bcache<br><br>lock<br>buf[NBUF]<br>head"] BUF["struct buf<br><br>valid<br>disk<br>dev<br>blockno<br>refcnt<br>lock<br>prev / next<br>data[BSIZE]"] end subgraph D["disk side"] direction LR BOOT_BLOCK["boot block"] SUPER_DISK["superblock on disk<br><br>filesystem geometry"] BITMAP["bitmap blocks<br><br>free / used block map"] DINODE["struct dinode<br><br>type<br>major<br>minor<br>nlink<br>size<br>addrs[]"] DATA_BLOCKS["data blocks<br><br>file bytes<br>directory dirents<br>indirect blocks"] LOG_DISK["log region on disk<br><br>log header block<br>logged block copies"] DISK_DEVICE["block device<br><br>virtio disk"] end MF -- "manages in-memory inode objects" --> M BF -- uses struct inode and superblock --> M BF -- write-side operations use transactions / log_write --> LF BF -- reads and writes disk blocks through buffers --> BC DF -- block allocation updates are logged --> LF DF -- reads/writes bitmap and superblock blocks --> BC LF -- keeps transaction state in RAM --> LM LF -- copies committed blocks through buffer cache --> BC BC -- caches and transfers disk blocks --> D ICACHE -- contains active inodes --> INODE LOG_STATE -- contains current transaction header --> LOG_HDR_MEM BCACHE -- contains many cached buffers --> BUF BUF -- transferred by virtio_disk_rw --> DISK_DEVICE ICACHE_FUNCS:::memfunc LOCK_FUNCS:::memfunc SUPER_FUNCS:::diskfunc BLOCK_FUNCS:::diskfunc META_FUNCS:::mixedfunc DATA_FUNCS:::mixedfunc DIR_FUNCS:::mixedfunc PATH_FUNCS:::mixedfunc LOG_FUNCS:::logfunc LOG_WRITE_FUNCS:::logfunc BC_FUNCS:::buffunc BCACHE:::bufmem BUF:::bufmem SB_MEM:::mem ICACHE:::mem INODE:::mem LOG_STATE:::logmem LOG_HDR_MEM:::logmem BOOT_BLOCK:::disk SUPER_DISK:::disk BITMAP:::disk DINODE:::disk DATA_BLOCKS:::data LOG_DISK:::logdisk DISK_DEVICE:::device classDef memfunc fill:#FFFFFF,stroke:#111,stroke-width:3px,color:#111 classDef diskfunc fill:#E9F1FF,stroke:#111,stroke-width:2px,color:#111 classDef mixedfunc fill:#EDE7D4,stroke:#111,stroke-width:2px,color:#111 classDef logfunc fill:#F5E6FF,stroke:#111,stroke-width:2px,color:#111 classDef buffunc fill:#E6FFF2,stroke:#111,stroke-width:2px,color:#111 classDef mem fill:#FFFFFF,stroke:#111,stroke-width:3px,color:#111 classDef logmem fill:#F9F0FF,stroke:#111,stroke-width:2px,color:#111 classDef bufmem fill:#F0FFF7,stroke:#111,stroke-width:2px,color:#111 classDef disk fill:#E9F1FF,stroke:#111,stroke-width:2px,color:#111 classDef logdisk fill:#EFE3FF,stroke:#111,stroke-width:2px,color:#111 classDef data fill:#F8F8F8,stroke:#111,stroke-width:2px,color:#111 classDef device fill:#EEEEEE,stroke:#111,stroke-width:2px,color:#111
-2026-05-12-042224.png)
File System Architecture and Implementation
The file system provides data organization, structural sharing among applications, and persistence across reboots. Disk hardware presents storage as a numbered sequence of 512-byte sectors, while xv6’s file system works in fixed-size blocks built on top of those sectors. Xv6 caches disk blocks in memory as struct buf objects, whose contents may temporarily differ from disk while a read is in progress or after software modifies a block before writing it back.
Xv6 divides the disk into a fixed layout:
- Block 0 is the unused boot sector.
- Block 1 is the superblock.
- The following blocks hold the log, inode blocks, bitmap blocks, and data blocks.
The superblock and initial file-system image are created by mkfs before xv6 boots.
The file system itself is constructed as a seven-layer hierarchy, where each level abstracts the layer below it and eventually exposes a uniform resource interface to user processes.
Layers of the xv6 file system.
Data resides on a block device structured into distinct regions:
- Block 0: Boot sector (unused by the file system)
- Block 1: Superblock containing filesystem metadata (total blocks, data blocks, inode count, log size)
- Log blocks: Designated area for transaction records
- Inode blocks: Packed arrays of file metadata structures
- Bitmap blocks: Bit array tracking free/allocated status of data blocks
- Data blocks: File contents and directory entries
Buffer Cache Layer
The buffer cache, implemented in bio.c, has two jobs: it synchronizes access to disk blocks and caches popular blocks in memory. Xv6 keeps at most one cached struct buf per disk block, so reads see recent writes and buffer locks can safely act as per-block locks.
binitlinks the fixedNBUFbuffer array into a doubly-linked LRU list headed bybcache.head.bcache.lockprotects cache metadata: which device/block each buffer represents, reference counts, and LRU list structure.- Each buffer has a sleep-lock protecting its block contents.
breadreturns a locked buffer, andbrelsereleases it. validmeans the buffer contains current block data.diskmeans the buffer has been handed to the disk driver.
The main interface is small:
bread(dev, blockno)returns a locked buffer for a block, reading from disk throughvirtio_disk_rwifvalid == 0.bwrite(b)writes a modified locked buffer back to disk.brelse(b)releases the buffer, decrements its reference count, and moves it to the front of the LRU list.
bget enforces the one-buffer-per-block invariant. While holding bcache.lock, it first searches for an existing matching buffer. If none exists, it scans from the LRU end for a buffer with refcnt == 0, rewrites its metadata, clears valid, and marks it referenced. The later sleep-lock acquisition is safe outside bcache.lock because the nonzero reference count prevents reuse for another block. If every buffer is busy, xv6 panics rather than sleeping for one.
Logging Layer (Crash Recovery)
Many file-system operations update several disk blocks. A crash after only some writes can leave the disk inconsistent, such as an inode pointing at a block that the bitmap says is free. Xv6 avoids this by logging all writes for an operation before installing them in their real disk locations.
The log lives at a fixed location recorded in the superblock. It contains a header block and the logged block copies:
- The header records the home sector number for each logged block.
- A header count of
0means no committed transaction is present. - A nonzero count means the log contains a complete committed transaction.
The logging interface wraps each file-system system call:
begin_opwaits if a commit is running or if there is not enough reserved log space.log_writerecords a modified buffer in the in-memory log, pins it in the buffer cache, and absorbs repeated writes to the same block into one log slot.end_opdecrements the number of outstanding operations and commits when the count reaches zero.
Commit has one critical point: writing the log header with a nonzero count. Before that point, recovery ignores the log and the operation disappears. After that point, recovery replays the logged writes, so the operation appears completely.
write_logcopies modified buffers from the cache to the on-disk log.write_headwrites the nonzero header and commits the transaction.install_transcopies logged blocks to their home locations.- A final
write_headclears the count to0, erasing the transaction.
recover_from_log runs during boot before user processes start. If the header count is nonzero, it installs the transaction and clears the log; otherwise it ignores the log. This makes each committed transaction atomic with respect to crashes.
Xv6 can group the writes of multiple complete system calls into one transaction, but it commits only when no file-system calls are active. The fixed log size also limits how many distinct blocks a transaction can contain, so large write calls are split into smaller transactions that fit.
Logging safely wraps structural modifications to disk blocks, including updates to inodes, directories, and the bitmap used to allocate file content blocks.
Inode Layer
An inode identifies a file, directory, or device independently of its name. The term refers both to the on-disk struct dinode and to the in-memory struct inode that the kernel uses while the inode is active.
- On-disk inodes are packed into the inode-block region and addressed by inode number.
typedistinguishes files, directories, devices, and free inodes (type == 0).nlinkcounts directory entries pointing at the inode.sizerecords file length in bytes.addrsrecords the direct and indirect data block numbers.
The inode table, itable, holds only inodes that kernel code currently references through C pointers, such as open files, current working directories, and temporary lookup state. It mainly synchronizes active inodes; caching is secondary because the buffer cache already keeps hot inode blocks in memory.
Xv6 separates inode references from inode locking:
itable.lockensures each active inode appears in memory at most once and protectsrefcounts.igetreturns a non-exclusive inode pointer and incrementsref; the pointer remains valid untiliput.ilockacquires the inode sleep-lock and loads the on-disk inode if needed.iunlockreleases the inode sleep-lock.iupdatewrites modified inode metadata back to disk.iputdrops a reference and, if bothrefandnlinkreach zero, truncates the file, marks the inode free, and writes it to disk.
This split lets many code paths hold stable inode pointers while only one thread at a time edits inode metadata or file contents. It also avoids deadlocks during operations that need more than one inode, such as pathname lookup.
iput may write to disk, so even read-looking system calls must run inside a log transaction if they might drop the last inode reference. Xv6 does not fully recover unlinked-but-still-open files after a crash, so such inodes can remain allocated on disk without any directory entry.
File contents are mapped through the inode’s addrs array. The representation is compact on disk, while bmap hides the direct and indirect block cases from higher-level code.
- The first
NDIRECTentries point directly to data blocks. - The final
addrsentry points to one indirect block containingNINDIRECTmore block numbers. - A zero block number means the file has no block allocated at that position.
bmap(ip, bn)returns the disk block for logical blockbn, allocating direct, indirect, or data blocks on demand.itruncfrees all direct blocks, all blocks named by the indirect block, and then the indirect block itself.
Higher-level inode I/O uses this mapping:
readichecks bounds, maps each file block withbmap, and copies bytes from cached buffers to the destination.writeimaps or allocates blocks, copies bytes into cached buffers, and growssizewhen writing past the old end of file.- Device inodes (
T_DEV) bypass normal file data blocks and dispatch through the device table. staticopies inode metadata into the user-visiblestatstructure.
The raw block allocator uses the on-disk bitmap beneath this mapping logic:
ballocscans bitmap blocks for a free bit, marks it allocated, and returns the corresponding data block.bfreeclears the bitmap bit for a released block.- Bitmap changes rely on buffer cache locks and must occur inside an active log transaction.
Just as inodes abstract physical data blocks into continuous files, directories abstract isolated inodes into a named hierarchical structure.
Directory Layer
Directories are files whose inode type is T_DIR. Their data is a sequence of fixed-size struct dirent records.
- Each directory entry stores an inode number and a name.
- Names are at most
DIRSIZcharacters and are NUL-terminated if shorter. - An inode number of
0marks a free directory slot.
dirlookup(dp, name, poff) scans a locked directory inode for a matching name. If it finds one, it stores the entry’s byte offset in *poff and returns the target inode through iget, but leaves that inode unlocked. This avoids deadlock: a lookup for . would otherwise try to lock the same directory inode twice, and .. can create more complex multi-directory locking cycles.
dirlink(dp, name, inum) adds a new directory entry. It first rejects duplicate names, then searches for a free slot; if none exists, it appends at the end of the directory. The new entry is written with the provided name and inode number.
Directories provide single-level naming, but users interact with deep hierarchical paths that require iterative evaluation.
Pathname Layer
Pathname lookup resolves strings like /usr/bin/cat by applying dirlookup to one path component at a time.
namei(path)returns the inode named by the full path.nameiparent(path, name)stops before the final component, returns the parent directory inode, and copies the final name intoname.- Both functions use
namex, which implements the shared traversal logic.
namex first chooses the starting inode: absolute paths begin at the root, and relative paths begin at the current working directory. It then loops over path components with skipelem.
Each iteration follows the same pattern:
- Lock the current inode with
ilockso its type and contents are available. - Verify that the current inode is a directory.
- Stop early for
nameiparentwhen the next component is the final one. - Use
dirlookupto get an unlocked referenced inode for the next component. - Unlock and release the current inode before moving to the next inode.
This design allows lookups in different directories to proceed concurrently and avoids deadlock. dirlookup returns an inode reference through iget, so the next inode cannot be deleted while namex drops the current directory lock. namex then locks only the next inode, which avoids relocking the same inode for . and avoids multi-directory lock cycles involving ...
Pathnames successfully identify specific files and directories, which are then abstracted alongside other I/O streams into file descriptors for user processes.
File Descriptor Layer
The file descriptor layer gives Unix its uniform I/O interface: regular files, directories, devices, and pipes can all be accessed through small per-process integers. Each process has its own file descriptor table, but entries point to global struct file objects.
A struct file wraps either an inode or a pipe:
typeidentifies the underlying object.refcounts references from process descriptor tables and duplicated descriptors.readableandwritablerecord the open mode.offstores the current I/O offset for inode-backed files; pipes do not use offsets.
Separate open calls create separate struct file objects with separate offsets. dup and fork share the same struct file, so they also share its offset.
The global file table, ftable, manages open-file objects:
fileallocfinds an unused file slot.filedupincrements the reference count.fileclosedrops a reference and releases the underlying pipe or inode when the count reaches zero.filestatcallsstatifor inode-backed files.filereadandfilewritecheck permissions, dispatch to pipe or inode code, and advanceofffor inode files.
Inode locking also serializes offset updates in fileread and filewrite. Concurrent writes through the same open file will not overwrite the same offset accidentally, though their data may still be interleaved.
Most file-system system calls are thin wrappers around these layers, but calls that edit directories rely heavily on transactions:
sys_linkcreates a new directory entry for an existing inode and incrementsnlink; on error, it must undo the link count change.sys_unlinkremoves a directory entry and decrements the target inode’s link count.createhandlesopen(O_CREATE),mkdir, andmknodby finding the parent directory, allocating or validating the target inode, initializing directory entries if needed, and linking the new inode into the parent.sys_openeither callscreateornamei, checks open-mode rules, allocates astruct file, allocates a process file descriptor, and connects them.sys_pipeallocates a pipe and installs the read and write ends as two file descriptors.
Transactions make multi-block updates such as link creation, unlinking, and new-file creation atomic with respect to crashes, so the syscall code does not need to reason about every possible disk-write ordering.
Real World
Real file systems solve the same problems as xv6, but with more scalable data structures, better failure handling, and richer resource abstractions.
- Buffer caches still provide caching and synchronization, but modern systems use more efficient lookup and eviction structures than xv6’s simple LRU list. They are often integrated with the virtual memory system to support memory-mapped files.
- Production logging systems avoid many xv6 limitations: they allow more concurrency, avoid logging whole blocks for tiny changes when possible, and batch disk writes more efficiently.
- Logging is not the only crash-recovery strategy. Older systems used tools like
fsckto scan and repair the whole file system after reboot, but log replay is usually much faster and gives cleaner operation-level atomicity. - The inode and directory model has persisted from early UNIX into systems like UFS, ext2, and ext3, but large directories often use balanced trees or similar indexed structures instead of xv6’s linear scans.
- Real systems must handle disk failures gracefully. Xv6 panics on disk errors, while production systems often isolate damaged files, rely on redundancy, or report recoverable I/O errors.
- Modern storage often spans multiple disks and can grow or shrink dynamically through RAID, volume managers, or integrated designs like ZFS. Xv6 assumes one fixed-size disk and a fixed inode region.
- Modern Unix-like systems expose many non-disk resources through file operations, including named pipes, network sockets, network file systems,
/proc, and user-level file systems. Instead of hard-coded branches infilereadandfilewrite, they typically dispatch through operation tables of function pointers.