ftable is the global file-table state.

Field	Meaning
`lock`	Protects file allocation and reference counts.
`file[NFILE]`	Fixed array of 100 open file objects.

Each struct file represents one open file description.

Field	Meaning
`type`	File kind: `FD_NONE`, `FD_PIPE`, `FD_INODE`, or `FD_DEVICE`.
`ref`	Number of references to this open file object.
`readable`	Says whether reads are allowed.
`writable`	Says whether writes are allowed.
`pipe`	Pipe pointer when `type == FD_PIPE`.
`ip`	Inode pointer when `type == FD_INODE` or `FD_DEVICE`.
`off`	Current file offset for inode files.
`major`	Device major number for device files.

The file table is used later by file descriptor operations. fileinit does not create any open files; it only prepares the lock. Actual struct file objects are claimed later from ftable.file[] by filealloc.

filealloc

Finds a free struct file in ftable.file[] by scanning for ref == 0.
Sets ref = 1 while holding ftable.lock.
Returns the claimed file object, or 0 if all NFILE slots are in use.
Does not set type, readable, writable, ip, pipe, off, or major;
The caller fills those fields.
Used by sys_open after path lookup or file creation has produced an inode.
Used by pipealloc twice, once for the pipe read end and once for the pipe write end.

filedup

Increments f->ref while holding ftable.lock.
Panics if the file is not currently open (ref < 1).
Does not allocate a new struct file; it creates another reference to the same open file object.
Used by dup when a process creates another file descriptor for the same open file.
Used during fork so the child and parent share the same open file objects.

fileclose

Decrements f->ref while holding ftable.lock.
If other references remain, it returns without closing the underlying object.
If the last reference is gone, it copies the file object to a local ff, clears the table slot with ref = 0 and type = FD_NONE, then releases ftable.lock.
For FD_PIPE, calls pipeclose(ff.pipe, ff.writable) so the pipe end is closed and sleepers can be woken.
For FD_INODE and FD_DEVICE, wraps iput(ff.ip) in begin_op() and end_op() so the inode reference is released inside a filesystem transaction.

fileread

Rejects the operation if f->readable == 0.
For FD_PIPE, calls piperead(f->pipe, addr, n).
For FD_DEVICE, validates f->major and dispatches to devsw[f->major].read(1, addr, n).
For FD_INODE, locks f->ip, calls readi(f->ip, 1, addr, f->off, n), advances f->off by the bytes read, and unlocks the inode.
Panics if the file type is not one of the valid readable file kinds.

filewrite

Rejects the operation if f->writable == 0.
For FD_PIPE, calls pipewrite(f->pipe, addr, n).
For FD_DEVICE, validates f->major and dispatches to devsw[f->major].write(1, addr, n).
For FD_INODE, writes in chunks small enough to fit the filesystem log transaction limit.
Each inode-file write chunk runs inside begin_op() and end_op(), locks f->ip, calls writei(f->ip, 1, addr + i, f->off, n1), advances f->off, and unlocks the inode.

File System Layers Diagram

---
config:
  layout: dagre
---
flowchart RL
 subgraph MF["memory inode-table side"]
    direction TB
        LOCK_FUNCS["inode locking<br><br>ilock<br>iunlock"]
        ICACHE_FUNCS["inode cache / lifetime<br><br>iinit<br>iget<br>idup<br>iput<br>iunlockput<br>ireclaim"]
  end

 subgraph M["memory inode side"]
    direction LR
        ICACHE["icache / itable<br><br>active struct inode objects"]
        INODE["struct inode<br><br>dev<br>inum<br>ref<br>lock<br>valid<br>type<br>major / minor<br>nlink<br>size<br>addrs[]"]
        SB_MEM["struct superblock sb<br><br>magic<br>size<br>nblocks<br>ninodes<br>nlog<br>logstart<br>inodestart<br>bmapstart"]
  end

 subgraph BF["bridge: memory inode to disk fs"]
    direction TB
        PATH_FUNCS["pathname layer<br><br>skipelem<br>namex<br>namei<br>nameiparent"]
        DIR_FUNCS["directory layer<br><br>namecmp<br>dirlookup<br>dirlink"]
        DATA_FUNCS["inode data I/O<br><br>bmap<br>readi<br>writei"]
        META_FUNCS["inode metadata<br><br>ialloc<br>iupdate<br>itrunc<br>stati"]
  end

 subgraph LF["log / transaction layer"]
    direction TB
        LOG_FUNCS["transaction control<br><br>initlog<br>begin_op<br>end_op"]
        LOG_WRITE_FUNCS["logged write + recovery<br><br>log_write<br>commit<br>write_log<br>write_head<br>install_trans<br>recover_from_log<br>read_head"]
  end

 subgraph LM["in-memory log state"]
    direction LR
        LOG_STATE["struct log<br><br>lock<br>start<br>dev<br>outstanding<br>committing<br>lh"]
        LOG_HDR_MEM["log.lh<br><br>n<br>block[]"]
  end

 subgraph DF["disk filesystem side"]
    direction TB
        SUPER_FUNCS["superblock init<br><br>readsb<br>fsinit"]
        BLOCK_FUNCS["block allocation<br><br>bzero<br>balloc<br>bfree"]
  end

 subgraph BC["buffer cache layer"]
    direction TB
        BC_FUNCS["buffer cache API<br><br>binit<br>bread<br>bwrite<br>brelse<br>bpin<br>bunpin"]
        BCACHE["bcache<br><br>lock<br>buf[NBUF]<br>head"]
        BUF["struct buf<br><br>valid<br>disk<br>dev<br>blockno<br>refcnt<br>lock<br>prev / next<br>data[BSIZE]"]
  end

 subgraph D["disk side"]
    direction LR
        BOOT_BLOCK["boot block"]
        SUPER_DISK["superblock on disk<br><br>filesystem geometry"]
        BITMAP["bitmap blocks<br><br>free / used block map"]
        DINODE["struct dinode<br><br>type<br>major<br>minor<br>nlink<br>size<br>addrs[]"]
        DATA_BLOCKS["data blocks<br><br>file bytes<br>directory dirents<br>indirect blocks"]
        LOG_DISK["log region on disk<br><br>log header block<br>logged block copies"]
        DISK_DEVICE["block device<br><br>virtio disk"]
  end

    MF -- "manages in-memory inode objects" --> M
    BF -- uses struct inode and superblock --> M
    BF -- write-side operations use transactions / log_write --> LF
    BF -- reads and writes disk blocks through buffers --> BC
    DF -- block allocation updates are logged --> LF
    DF -- reads/writes bitmap and superblock blocks --> BC
    LF -- keeps transaction state in RAM --> LM
    LF -- copies committed blocks through buffer cache --> BC
    BC -- caches and transfers disk blocks --> D
    ICACHE -- contains active inodes --> INODE
    LOG_STATE -- contains current transaction header --> LOG_HDR_MEM
    BCACHE -- contains many cached buffers --> BUF
    BUF -- transferred by virtio_disk_rw --> DISK_DEVICE

     ICACHE_FUNCS:::memfunc
     LOCK_FUNCS:::memfunc
     SUPER_FUNCS:::diskfunc
     BLOCK_FUNCS:::diskfunc
     META_FUNCS:::mixedfunc
     DATA_FUNCS:::mixedfunc
     DIR_FUNCS:::mixedfunc
     PATH_FUNCS:::mixedfunc
     LOG_FUNCS:::logfunc
     LOG_WRITE_FUNCS:::logfunc
     BC_FUNCS:::buffunc
     BCACHE:::bufmem
     BUF:::bufmem
     SB_MEM:::mem
     ICACHE:::mem
     INODE:::mem
     LOG_STATE:::logmem
     LOG_HDR_MEM:::logmem
     BOOT_BLOCK:::disk
     SUPER_DISK:::disk
     BITMAP:::disk
     DINODE:::disk
     DATA_BLOCKS:::data
     LOG_DISK:::logdisk
     DISK_DEVICE:::device
    classDef memfunc fill:#FFFFFF,stroke:#111,stroke-width:3px,color:#111
    classDef diskfunc fill:#E9F1FF,stroke:#111,stroke-width:2px,color:#111
    classDef mixedfunc fill:#EDE7D4,stroke:#111,stroke-width:2px,color:#111
    classDef logfunc fill:#F5E6FF,stroke:#111,stroke-width:2px,color:#111
    classDef buffunc fill:#E6FFF2,stroke:#111,stroke-width:2px,color:#111
    classDef mem fill:#FFFFFF,stroke:#111,stroke-width:3px,color:#111
    classDef logmem fill:#F9F0FF,stroke:#111,stroke-width:2px,color:#111
    classDef bufmem fill:#F0FFF7,stroke:#111,stroke-width:2px,color:#111
    classDef disk fill:#E9F1FF,stroke:#111,stroke-width:2px,color:#111
    classDef logdisk fill:#EFE3FF,stroke:#111,stroke-width:2px,color:#111
    classDef data fill:#F8F8F8,stroke:#111,stroke-width:2px,color:#111
    classDef device fill:#EEEEEE,stroke:#111,stroke-width:2px,color:#111

File System Architecture and Implementation

The file system provides data organization, structural sharing among applications, and persistence across reboots. Disk hardware presents storage as a numbered sequence of 512-byte sectors, while xv6’s file system works in fixed-size blocks built on top of those sectors. Xv6 caches disk blocks in memory as struct buf objects, whose contents may temporarily differ from disk while a read is in progress or after software modifies a block before writing it back.

Xv6 divides the disk into a fixed layout:

Block 0 is the unused boot sector.
Block 1 is the superblock.
The following blocks hold the log, inode blocks, bitmap blocks, and data blocks.

The superblock and initial file-system image are created by mkfs before xv6 boots.

The file system itself is constructed as a seven-layer hierarchy, where each level abstracts the layer below it and eventually exposes a uniform resource interface to user processes.

Layers of the xv6 file system.

Data resides on a block device structured into distinct regions:

Block 0: Boot sector (unused by the file system)
Block 1: Superblock containing filesystem metadata (total blocks, data blocks, inode count, log size)
Log blocks: Designated area for transaction records
Inode blocks: Packed arrays of file metadata structures
Bitmap blocks: Bit array tracking free/allocated status of data blocks
Data blocks: File contents and directory entries

Buffer Cache Layer

The buffer cache, implemented in bio.c, has two jobs: it synchronizes access to disk blocks and caches popular blocks in memory. Xv6 keeps at most one cached struct buf per disk block, so reads see recent writes and buffer locks can safely act as per-block locks.

binit links the fixed NBUF buffer array into a doubly-linked LRU list headed by bcache.head.
bcache.lock protects cache metadata: which device/block each buffer represents, reference counts, and LRU list structure.
Each buffer has a sleep-lock protecting its block contents. bread returns a locked buffer, and brelse releases it.
valid means the buffer contains current block data. disk means the buffer has been handed to the disk driver.

The main interface is small:

bread(dev, blockno) returns a locked buffer for a block, reading from disk through virtio_disk_rw if valid == 0.
bwrite(b) writes a modified locked buffer back to disk.
brelse(b) releases the buffer, decrements its reference count, and moves it to the front of the LRU list.

bget enforces the one-buffer-per-block invariant. While holding bcache.lock, it first searches for an existing matching buffer. If none exists, it scans from the LRU end for a buffer with refcnt == 0, rewrites its metadata, clears valid, and marks it referenced. The later sleep-lock acquisition is safe outside bcache.lock because the nonzero reference count prevents reuse for another block. If every buffer is busy, xv6 panics rather than sleeping for one.

Logging Layer (Crash Recovery)

Many file-system operations update several disk blocks. A crash after only some writes can leave the disk inconsistent, such as an inode pointing at a block that the bitmap says is free. Xv6 avoids this by logging all writes for an operation before installing them in their real disk locations.

The log lives at a fixed location recorded in the superblock. It contains a header block and the logged block copies:

The header records the home sector number for each logged block.
A header count of 0 means no committed transaction is present.
A nonzero count means the log contains a complete committed transaction.

The logging interface wraps each file-system system call:

begin_op waits if a commit is running or if there is not enough reserved log space.
log_write records a modified buffer in the in-memory log, pins it in the buffer cache, and absorbs repeated writes to the same block into one log slot.
end_op decrements the number of outstanding operations and commits when the count reaches zero.

Commit has one critical point: writing the log header with a nonzero count. Before that point, recovery ignores the log and the operation disappears. After that point, recovery replays the logged writes, so the operation appears completely.

write_log copies modified buffers from the cache to the on-disk log.
write_head writes the nonzero header and commits the transaction.
install_trans copies logged blocks to their home locations.
A final write_head clears the count to 0, erasing the transaction.

recover_from_log runs during boot before user processes start. If the header count is nonzero, it installs the transaction and clears the log; otherwise it ignores the log. This makes each committed transaction atomic with respect to crashes.

Xv6 can group the writes of multiple complete system calls into one transaction, but it commits only when no file-system calls are active. The fixed log size also limits how many distinct blocks a transaction can contain, so large write calls are split into smaller transactions that fit.

Logging safely wraps structural modifications to disk blocks, including updates to inodes, directories, and the bitmap used to allocate file content blocks.

Inode Layer

An inode identifies a file, directory, or device independently of its name. The term refers both to the on-disk struct dinode and to the in-memory struct inode that the kernel uses while the inode is active.

On-disk inodes are packed into the inode-block region and addressed by inode number.
type distinguishes files, directories, devices, and free inodes (type == 0).
nlink counts directory entries pointing at the inode.
size records file length in bytes.
addrs records the direct and indirect data block numbers.

The inode table, itable, holds only inodes that kernel code currently references through C pointers, such as open files, current working directories, and temporary lookup state. It mainly synchronizes active inodes; caching is secondary because the buffer cache already keeps hot inode blocks in memory.

Xv6 separates inode references from inode locking:

itable.lock ensures each active inode appears in memory at most once and protects ref counts.
iget returns a non-exclusive inode pointer and increments ref; the pointer remains valid until iput.
ilock acquires the inode sleep-lock and loads the on-disk inode if needed.
iunlock releases the inode sleep-lock.
iupdate writes modified inode metadata back to disk.
iput drops a reference and, if both ref and nlink reach zero, truncates the file, marks the inode free, and writes it to disk.

This split lets many code paths hold stable inode pointers while only one thread at a time edits inode metadata or file contents. It also avoids deadlocks during operations that need more than one inode, such as pathname lookup.

iput may write to disk, so even read-looking system calls must run inside a log transaction if they might drop the last inode reference. Xv6 does not fully recover unlinked-but-still-open files after a crash, so such inodes can remain allocated on disk without any directory entry.

File contents are mapped through the inode’s addrs array. The representation is compact on disk, while bmap hides the direct and indirect block cases from higher-level code.

The first NDIRECT entries point directly to data blocks.
The final addrs entry points to one indirect block containing NINDIRECT more block numbers.
A zero block number means the file has no block allocated at that position.
bmap(ip, bn) returns the disk block for logical block bn, allocating direct, indirect, or data blocks on demand.
itrunc frees all direct blocks, all blocks named by the indirect block, and then the indirect block itself.

Higher-level inode I/O uses this mapping:

readi checks bounds, maps each file block with bmap, and copies bytes from cached buffers to the destination.
writei maps or allocates blocks, copies bytes into cached buffers, and grows size when writing past the old end of file.
Device inodes (T_DEV) bypass normal file data blocks and dispatch through the device table.
stati copies inode metadata into the user-visible stat structure.

The raw block allocator uses the on-disk bitmap beneath this mapping logic:

balloc scans bitmap blocks for a free bit, marks it allocated, and returns the corresponding data block.
bfree clears the bitmap bit for a released block.
Bitmap changes rely on buffer cache locks and must occur inside an active log transaction.

Just as inodes abstract physical data blocks into continuous files, directories abstract isolated inodes into a named hierarchical structure.

Directory Layer

Directories are files whose inode type is T_DIR. Their data is a sequence of fixed-size struct dirent records.

Each directory entry stores an inode number and a name.
Names are at most DIRSIZ characters and are NUL-terminated if shorter.
An inode number of 0 marks a free directory slot.

dirlookup(dp, name, poff) scans a locked directory inode for a matching name. If it finds one, it stores the entry’s byte offset in *poff and returns the target inode through iget, but leaves that inode unlocked. This avoids deadlock: a lookup for . would otherwise try to lock the same directory inode twice, and .. can create more complex multi-directory locking cycles.

dirlink(dp, name, inum) adds a new directory entry. It first rejects duplicate names, then searches for a free slot; if none exists, it appends at the end of the directory. The new entry is written with the provided name and inode number.

Directories provide single-level naming, but users interact with deep hierarchical paths that require iterative evaluation.

Pathname Layer

Pathname lookup resolves strings like /usr/bin/cat by applying dirlookup to one path component at a time.

namei(path) returns the inode named by the full path.
nameiparent(path, name) stops before the final component, returns the parent directory inode, and copies the final name into name.
Both functions use namex, which implements the shared traversal logic.

namex first chooses the starting inode: absolute paths begin at the root, and relative paths begin at the current working directory. It then loops over path components with skipelem.

Each iteration follows the same pattern:

Lock the current inode with ilock so its type and contents are available.
Verify that the current inode is a directory.
Stop early for nameiparent when the next component is the final one.
Use dirlookup to get an unlocked referenced inode for the next component.
Unlock and release the current inode before moving to the next inode.

This design allows lookups in different directories to proceed concurrently and avoids deadlock. dirlookup returns an inode reference through iget, so the next inode cannot be deleted while namex drops the current directory lock. namex then locks only the next inode, which avoids relocking the same inode for . and avoids multi-directory lock cycles involving ...

Pathnames successfully identify specific files and directories, which are then abstracted alongside other I/O streams into file descriptors for user processes.

File Descriptor Layer

The file descriptor layer gives Unix its uniform I/O interface: regular files, directories, devices, and pipes can all be accessed through small per-process integers. Each process has its own file descriptor table, but entries point to global struct file objects.

A struct file wraps either an inode or a pipe:

type identifies the underlying object.
ref counts references from process descriptor tables and duplicated descriptors.
readable and writable record the open mode.
off stores the current I/O offset for inode-backed files; pipes do not use offsets.

Separate open calls create separate struct file objects with separate offsets. dup and fork share the same struct file, so they also share its offset.

The global file table, ftable, manages open-file objects:

filealloc finds an unused file slot.
filedup increments the reference count.
fileclose drops a reference and releases the underlying pipe or inode when the count reaches zero.
filestat calls stati for inode-backed files.
fileread and filewrite check permissions, dispatch to pipe or inode code, and advance off for inode files.

Inode locking also serializes offset updates in fileread and filewrite. Concurrent writes through the same open file will not overwrite the same offset accidentally, though their data may still be interleaved.

Most file-system system calls are thin wrappers around these layers, but calls that edit directories rely heavily on transactions:

sys_link creates a new directory entry for an existing inode and increments nlink; on error, it must undo the link count change.
sys_unlink removes a directory entry and decrements the target inode’s link count.
create handles open(O_CREATE), mkdir, and mknod by finding the parent directory, allocating or validating the target inode, initializing directory entries if needed, and linking the new inode into the parent.
sys_open either calls create or namei, checks open-mode rules, allocates a struct file, allocates a process file descriptor, and connects them.
sys_pipe allocates a pipe and installs the read and write ends as two file descriptors.

Transactions make multi-block updates such as link creation, unlinking, and new-file creation atomic with respect to crashes, so the syscall code does not need to reason about every possible disk-write ordering.

Real World

Real file systems solve the same problems as xv6, but with more scalable data structures, better failure handling, and richer resource abstractions.

Buffer caches still provide caching and synchronization, but modern systems use more efficient lookup and eviction structures than xv6’s simple LRU list. They are often integrated with the virtual memory system to support memory-mapped files.
Production logging systems avoid many xv6 limitations: they allow more concurrency, avoid logging whole blocks for tiny changes when possible, and batch disk writes more efficiently.
Logging is not the only crash-recovery strategy. Older systems used tools like fsck to scan and repair the whole file system after reboot, but log replay is usually much faster and gives cleaner operation-level atomicity.
The inode and directory model has persisted from early UNIX into systems like UFS, ext2, and ext3, but large directories often use balanced trees or similar indexed structures instead of xv6’s linear scans.
Real systems must handle disk failures gracefully. Xv6 panics on disk errors, while production systems often isolate damaged files, rely on redundancy, or report recoverable I/O errors.
Modern storage often spans multiple disks and can grow or shrink dynamically through RAID, volume managers, or integrated designs like ZFS. Xv6 assumes one fixed-size disk and a fixed inode region.
Modern Unix-like systems expose many non-disk resources through file operations, including named pipes, network sockets, network file systems, /proc, and user-level file systems. Instead of hard-coded branches in fileread and filewrite, they typically dispatch through operation tables of function pointers.

My Knowledge Base

Explorer

9 VFS