An operating system manages and abstracts low-level hardware, shares physical resources among multiple programs, and provides controlled ways for programs to interact.

  • Kernel: A special privileged program that provides core services to running programs.
  • Process: A running program consisting of memory (instructions, data, and a stack) and private state managed by the kernel.
  • System Call: A defined entry point in the operating system’s interface that transitions execution from user space to kernel space to perform privileged operations.
  • Hardware Protection: The kernel utilizes CPU mechanisms to ensure processes access only their own memory and execute without hardware privileges.

Processes and memory

The operating system time-shares hardware by transparently switching available CPUs among waiting processes, saving and restoring CPU registers during transitions.

  • Process Identifier (PID): A unique integer the kernel associates with each process.
  • Process Creation:
    • fork() creates a new child process by exactly duplicating the parent’s memory contents.
    • fork() returns in the child process and the child’s PID in the parent process.
    • The parent and child execute independently with different memory spaces and registers; changes in one do not affect the other.
  • Process Execution:
    • exec(file, argv) replaces the calling process’s memory with a new memory image loaded from a file (structured in the ELF format).
    • exec() takes an executable filename and an array of string arguments, starting execution at the binary’s declared entry point without returning to the calling program.
  • Process Termination and Synchronization:
    • exit(status) stops the calling process and releases resources like memory and open files. A status of conventionally indicates success, while indicates failure.
    • wait(*status) pauses the calling process until a child exits, returning the child’s PID and copying its exit status into the provided address.
  • Memory Management:
    • Most user-space memory is allocated implicitly during fork() and exec().
    • sbrk(n) grows a process’s data memory by bytes dynamically at run-time and returns the location of the new memory.

File

A file descriptor is a small integer acting as an index into a per-process table, representing a kernel-managed object such as a file, directory, device, or pipe.

  • Standard Conventions: By default, processes read from file descriptor (standard input), write to (standard output), and write errors to (standard error).
  • Core I/O Operations:
    • read(fd, buf, n) reads up to bytes from into , advancing the file offset by the number of bytes read. It returns to indicate the end of the file.
    • write(fd, buf, n) writes bytes from to , advancing the file offset sequentially.
    • close(fd) releases a file descriptor for future reuse. Newly allocated file descriptors always use the lowest-numbered unused integer for the current process.
  • I/O Redirection:
    • fork() copies the parent’s file descriptor table to the child, granting the child the exact same open files.
    • exec() replaces the process memory but completely preserves the file table.
    • A shell redirects I/O by forking a child, closing standard file descriptors, opening specific files to claim those low-numbered descriptors, and then calling exec() to run the new program.
  • Offset Sharing:
    • Underlying file offsets are shared between file descriptors only if they were derived from the same original descriptor via fork() or dup().
    • dup(fd) duplicates an existing descriptor, returning a new one that refers to the same underlying I/O object and shares its offset.

Pipes

A pipe is a small kernel buffer exposed to processes as a pair of file descriptors: one for reading and one for writing.

  • Creation: pipe(p) creates the buffer and records the read descriptor in and the write descriptor in .
  • Communication Flow:
    • Writing data to the write end makes it available for reading at the read end.
    • If no data is available, a read operation blocks until data is written or until all file descriptors referring to the write end are closed.
    • If all write ends are closed, read() returns , simulating an end-of-file. This requires processes to rigorously close unused write descriptors to prevent readers from waiting indefinitely.
  • Advantages Over Temporary Files:
    • Pipes automatically clean themselves up, whereas temporary files require explicit deletion.
    • Pipes can pass arbitrarily long streams of data without being constrained by disk space.
    • Pipes allow parallel execution of pipeline stages, unlike files which require the first program to finish before the second starts.
    • Blocking reads and writes in pipes are significantly more efficient than non-blocking file semantics for inter-process communication.

File system

The file system provides data files (uninterpreted byte arrays) and directories (named references to files and other directories), structured as a tree originating from a root directory.

  • Path Resolution:
    • Paths beginning with / are evaluated from the root directory.
    • Paths not beginning with / are evaluated relative to the calling process’s current directory, which can be modified using chdir(dir).
  • Inodes and Links:
    • Inode: The underlying physical file object that holds file metadata, including type (file, directory, or device), length, disk location, and the number of links.
    • Link: An entry in a directory containing a filename and a reference to an inode.
    • A single inode can have multiple links (names) pointing to it.
  • File System Operations:
    • mkdir(dir) creates a new directory.
    • open(file, O_CREATE) creates a new data file.
    • mknod(file, major, minor) creates a special device file that diverts I/O system calls directly to a kernel device implementation identified by major and minor numbers.
    • link(file1, file2) creates a new name (file2) referring to the exact same inode as an existing file (file1).
    • unlink(file) removes a name from the file system. The underlying inode and disk space are only freed when the file’s link count drops to and no active file descriptors refer to it.
    • fstat(fd, *st) and stat(file, *st) retrieve inode information into a struct stat object.

Core requirements

  • Operating systems must satisfy three core requirements: multiplexing resources, isolating activities, and enabling controlled interaction between processes.
  • Cooperative time-sharing and direct hardware access by applications are insufficient for strong isolation, as they require applications to be bug-free and mutually trusting.
  • Hardware resources are abstracted into kernel-managed services to enforce safety and convenience.
  • Storage is abstracted into file systems, physical memory is abstracted into memory images via exec, and CPUs are abstracted by transparently switching context between processes.
  • File descriptors abstract diverse I/O details (e.g., pipes, files) and natively support interaction protocols, such as automatically generating end-of-file signals when a pipeline fails.

User mode, supervisor mode and system calls

  • CPUs provide hardware execution modes to establish a hard boundary between application code and the operating system.
  • Machine Mode: Starts upon CPU boot, executes with full hardware privilege, and is strictly used for low-level computer configuration.
  • Supervisor Mode: Allows execution of privileged instructions necessary for OS operations, such as enabling interrupts or writing to page table registers. Software running in this mode is the kernel, executing in kernel space.
  • User Mode: Restricts execution to unprivileged instructions. Applications execute in this mode within user space.
  • If a user-mode application attempts a privileged instruction, the CPU suppresses the instruction and forcefully switches to supervisor mode so the kernel can terminate the application.
  • Applications invoke kernel services via system calls using specialized instructions (e.g., the RISC-V ecall instruction).
  • System calls switch the CPU to supervisor mode at a strictly kernel-defined entry point, preventing malicious applications from bypassing argument validation or access control checks.

Kernel organization

  • Monolithic Kernel: The entirety of the operating system resides within the kernel and executes in supervisor mode.
    • Subsystems (e.g., file systems, virtual memory) are tightly integrated, allowing them to share data structures like buffer caches efficiently.
    • Internal interfaces are complex; a single programming error in supervisor mode typically causes a fatal failure of the entire system.
    • xv6 and Linux utilize a monolithic kernel structure.
  • Microkernel: The kernel is minimized to only low-level functions (e.g., hardware access, message passing), while the bulk of the OS runs as user-level processes called servers.
    • Applications request services (like file system operations) by passing messages to these servers via the kernel’s inter-process communication (IPC) mechanism.
    • This limits the amount of code executing with hardware privileges, reducing the risk of catastrophic system crashes.
    • Minix, L4, and QNX utilize a microkernel structure.

The Process Abstraction

  • A process is the fundamental unit of isolation, shielding an application’s memory, CPU state, and file descriptors from interference by other processes.
  • A process bundles two foundational architectural illusions:
    • Private Address Space: Simulates private physical memory using hardware page tables.
      • RISC-V page tables translate virtual addresses utilized by instructions into physical addresses on the RAM chip.
      • The layout begins at virtual address zero with instructions, global variables, the stack, and the heap.
      • The address space is bounded by hardware translation limits; xv6 uses 38 bits of addressable space, establishing a maximum virtual address of (MAXVA).
      • The top pages of the address space are reserved for a trampoline page (managing user/kernel transitions) and a trapframe page (saving user state).
    • Private CPU (Thread): Simulates dedicated processor execution.
      • Each process contains a thread of execution that tracks local variables and return addresses on stacks.
      • A process actively alternates between two stacks: a user stack for user-space computation, and a kernel stack used exclusively during system calls and interrupts.
      • The kernel stack is protected from user-space access to ensure the kernel can execute safely even if the user stack is compromised.
  • Kernel state for each process is centralized in a proc structure, containing references to the process’s page table (p->pagetable), kernel stack (p->kstack), and run state (p->state).
  • During a system call, hardware elevates the privilege level, switches the program counter to the kernel entry point, executes on the kernel stack, and subsequently utilizes the sret instruction to lower privileges and resume the user thread.

Layout of a process’s virtual space:

System initialization

  • Boot Sequence:
    • A boot loader loads the kernel into physical RAM at 0x80000000, placing it above the address range (0x0 to 0x80000000) reserved for memory-mapped I/O devices.
    • The CPU begins in machine mode with virtual address paging disabled.
    • Assembly instructions at _entry allocate an initial stack (stack0) to support C code execution and call start.
    • The start function configures machine-mode settings, sets up timer interrupts, delegates exceptions to supervisor mode, and utilizes the mret instruction to cleanly force a transition into supervisor mode at the main function.
  • Process Creation:
    • main initializes OS devices and subsystems, then explicitly calls userinit to construct the very first process.
    • This initial process runs a minimal assembly program (initcode.S) to execute the exec system call.
    • The kernel handles the system call by replacing the process memory with the /init binary.
    • The /init process opens console file descriptors (0, 1, 2) and launches a shell, yielding a fully operational system.

Page Tables

Page tables are the most popular mechanism through which the operating system provides each process with its own private address space and memory.

Paging Hardware

  • RISC-V instructions manipulate virtual addresses, while the machine’s RAM uses physical addresses.
  • The Sv39 RISC-V architecture utilizes only the bottom 39 bits of a 64-bit virtual address, ignoring the top 25 bits.
  • The page table structure physically maps these addresses:
    • Logically acts as an array of Page Table Entries (PTEs).
    • Each PTE translates a virtual address to a physical address at the granularity of a 4096-byte ( bytes) page.
    • A PTE contains a 44-bit Physical Page Number (PPN) and hardware control flags.
    • The CPU constructs a 56-bit physical address by combining the 44-bit PPN from the PTE with the bottom 12 bits of the original virtual address.

RISC-V virtual and physical addresses, with a simplified logical page table:

  • Three-level tree implementation:
    • A page table is stored in physical memory as a three-level tree of 4096-byte pages.
    • The root page contains 512 PTEs pointing to intermediate pages, which point to bottom-level pages containing the final physical mappings.
    • The 27-bit virtual page number is split into three 9-bit sections to index into each of the three levels.
    • This tree structure saves physical memory by omitting entirely unmapped intermediate and bottom-level page directories.

RISC-V address translation details:

  • Hardware integration:
    • The Translation Look-aside Buffer (TLB) caches PTEs inside the CPU to eliminate the performance cost of loading PTEs from memory during every address translation.
    • The satp register holds the physical address of the root page-table page, telling the CPU which page table tree to use for the currently executing thread.
  • PTE flags control access permissions:
    • PTE_V: Indicates the PTE is present and valid.
    • PTE_R, PTE_W, PTE_X: Control read, write, and execute permissions, respectively.
    • PTE_U: Allows access by instructions executing in user mode.
    • Note: PTE_U is per-page, not per-page-table; xv6 maps pages like the trapframe in a user process’s page table so trap entry/return code can access them, but clears PTE_U so user-mode code cannot read or modify kernel-owned state.

Kernel Address Space

  • xv6 uses one page table per process for user space and one shared page table for the kernel.
  • The kernel page table gives predictable virtual addresses for RAM and memory-mapped devices.

On the left, xv6’s kernel address space. RWX refer to PTE read, write, and execute permissions. On the right, the RISC-V physical address space that xv6 expects to see:

  • Direct mapping architecture:
    • Most physical memory and device registers are mapped at virtual addresses exactly equal to their physical addresses.
    • The kernel binary is located at KERNBASE (0x80000000) in both virtual and physical memory spaces.
    • In QEMU, RAM starts at 0x80000000 and extends at least to 0x86400000 (PHYSTOP).
    • Memory-mapped device registers sit below 0x80000000 in physical address space.
    • Direct mapping lets the kernel use physical addresses directly, which simplifies operations such as copying pages during fork.
  • Exceptions to direct mapping:
    • Trampoline page: mapped twice, once via direct mapping and once at the top of the virtual address space.
    • Kernel stacks: each process has a private kernel stack mapped high in memory.
    • An unmapped guard page below each kernel stack catches overflow (PTE_V clear).
  • Kernel-space permissions:
    • The trampoline page and kernel text are mapped with PTE_R | PTE_X.
    • Other kernel memory is mapped with PTE_R | PTE_W.
    • Guard pages are invalid.

Page Table Management Code

  • The central data structure for software page table manipulation is pagetable_t, a C pointer to a RISC-V root page-table page.
  • Core virtual memory lookup functions:
    • walk: Mimics the hardware’s 3-level traversal, using 9 bits at a time to descend the tree and return the address of the lowest-level PTE. It can dynamically allocate intermediate pages if requested during the traversal.
    • mappages: Installs PTEs for a virtual-to-physical address range by calling walk for each page interval and configuring the PPN and permission flags.
  • Kernel initialization routines:
    • kvminit creates the kernel page table during early boot, mapping the kernel instructions, data, physical memory up to PHYSTOP, and device memory.
    • kvminithart writes the root page table physical address into the CPU’s satp register to enable hardware address translation.
    • The sfence.vma instruction is executed immediately after satp is modified to flush the CPU’s TLB, preventing stale cached mappings from causing invalid memory accesses.

Physical Memory Allocation

  • The kernel manages physical memory between the end of the kernel binary and PHYSTOP as a global pool for run-time allocation.
  • Memory is allocated and freed strictly in 4096-byte page increments.
  • Free pages are tracked using a linked list threaded directly through the available memory pages themselves.
  • Allocator implementation:
    • Each free page stores a struct run structure containing a pointer to the next free page.
    • The kfree function fills freed memory with the garbage value 1 to expose dangling references quickly, then prepends the page to the free list.
    • The kalloc function removes and returns the first element from the free list when memory is requested.
    • The free list structure is protected by a spin lock to handle concurrent allocation requests across multiple CPUs.

Process Address Space

  • Each process possesses an independent page table, dictating a private address space that maps contiguous virtual addresses starting at zero to potentially non-contiguous physical pages.
  • Address space layout:
    • Grows upwards to MAXVA, addressing up to 256 Gigabytes of virtual memory.
    • Ordered sequentially from zero: user instructions, global variables, user stack, and an expandable heap.
    • The trampoline page is mapped at the top of the user address space to facilitate kernel transitions.
    • An inaccessible guard page (PTE_U flag cleared) sits directly below the user stack to catch stack overflows via hardware page-fault exceptions.
  • Dynamic memory allocation (sbrk):
    • The sbrk system call shrinks or grows a process’s memory.
    • growproc invokes uvmalloc to acquire new physical pages via kalloc and maps them using mappages.
    • uvmdealloc removes memory by calling uvmunmap, which utilizes walk to locate PTEs and passes the associated physical addresses back to kfree.
    • The user page table serves as the definitive kernel record of which physical pages are allocated to a process.

A process’s user address space, with its initial stack:

User virtual memory looks contiguous, but physical pages may be scattered anywhere in RAM.

Virtual regionPhysical backing
textPhysical RAM pages allocated from the free-memory pool, then filled from the ELF file.
dataPhysical RAM pages allocated from the free-memory pool.
heapPhysical RAM pages allocated from the free-memory pool as the heap grows.
stackPhysical RAM pages allocated from the free-memory pool.
guard pageNo physical page; the mapping is left invalid.
trapframeOne per-process physical page allocated from the free-memory pool.
trampolineOne shared kernel code page, mapped into every process.

ELF Binary Loading

  • The exec system call replaces an address space’s existing memory image with a new executable stored in the Executable and Linkable Format (ELF).
  • Initialization and parsing steps:
    • Validates the file via a 4-byte magic number (0x7F 'E' 'L' 'F').
    • Allocates a blank page table via proc_pagetable.
    • Parses ELF program section headers (struct proghdr) to determine memory sizing and block alignments.
    • Allocates contiguous virtual memory per segment with uvmalloc and populates the pages directly from the file via loadseg.
  • Stack setup:
    • Allocates a single stack page and a protective inaccessible guard page.
    • Copies command-line argument strings and pointers to the top of the newly allocated stack, preparing argc and argv for the program’s main function.
  • Security and commitment:
    • Verifies that segment virtual addresses and sizes do not mathematically overflow a 64-bit integer, preventing malicious binaries from tricking the kernel into mapping data over kernel space.
    • Retains the old address space until the entire new image is successfully built. If an error occurs during parsing or allocation, the partial new image is freed and exec returns an error, safely preserving the original process state.

Traps

Three distinct events force a CPU to suspend ordinary instruction execution and transfer control to specialized handler code:

  • system calls initiated by the ecall instruction,
  • exceptions triggered by illegal operations (such as division by zero or invalid virtual addresses), and
  • device interrupts signaling hardware needs. These events, collectively referred to as traps, must be handled transparently so the interrupted code can resume without disruption. Complete isolation is maintained by handling all traps exclusively in kernel space. The trap handling lifecycle consists of four stages:
  1. hardware actions by the RISC-V CPU,
  2. assembly instructions to save state,
  3. a C function to determine the trap’s cause,
  4. and the specific service routine.

RISC-V Trap Machinery

The RISC-V hardware dictates trap behavior through supervisor-mode control registers, which are inaccessible to user mode:

  • stvec: Stores the memory address of the kernel’s trap handler (virtual address).
  • sepc: Captures the program counter at the exact moment the trap occurs. The sret instruction later copies this value back to the program counter to resume execution.
  • scause: Stores a numeric code indicating the reason for the trap.
  • sscratch: Provides temporary storage crucial for the very first instructions of the trap handler.
  • sstatus: Contains the SIE bit, which controls whether device interrupts are deferred, and the SPP bit, which records whether the trap originated in user or supervisor mode.

The registers relate to traps handled in supervisor mode, and they cannot be read or written in user mode. There is a similar set of control registers for traps handled in machine mode; xv6 uses them only for the special case of timer interrupts.

Each CPU on a multi-core chip has its own set of these registers, and more than one CPU may be handling a trap at any given time.

When forcing a trap (excluding timer interrupts), the hardware executes a strict sequence of operations:

Note: In xv6, timer interrupts first enter machine mode and are then forwarded to supervisor mode as a software interrupt.

  1. Aborts the trap if it is a device interrupt and the SIE bit is clear.
  2. Disables further interrupts by clearing SIE.
  3. Copies the current program counter to sepc.
  4. Saves the current execution mode into the SPP bit.
  5. Writes the trap cause into scause.
  6. Elevates the execution mode to supervisor mode.
  7. Copies the handler address from stvec to the program counter.
  8. Resumes execution at the new instruction address.

The CPU intentionally minimizes its hardware operations; it does not switch page tables, switch to a kernel stack, or save general-purpose registers. This minimal hardware intervention preserves flexibility and prevents security vulnerabilities, such as a malicious application directing the kernel entry point.

Traps from User Space

Xv6 handles traps differently depending on whether they come from user space or kernel space.

From user space, a trap may be caused by:

  • ecall,
  • an exception,
  • or a device interrupt.

The path is uservec -> usertrap -> usertrapret -> userret.

When a trap occurs in user space, the active page table is still the user page table, since RISC-V does not switch page tables on trap entry. Thus:

  • the trap handler address in stvec must have a valid mapping in the user page table.
  • xv6’s trap handling code needs to switch to the kernel page table.
  • in order to be able to continue executing after that switch, the kernel page table must also have a mapping for the handler pointed to by stvec.
  • xv6 satisfies these requirements using a trampoline page.
  • xv6 sets stvec to uservec on the trampoline page, mapped at TRAMPOLINE in both the user and kernel page tables.
  • The trampoline mapping is identical in both page tables, so trap handling can continue after switching satp.

The user space trap sequence flows through four primary stages:

  • Assembly Entry (uservec):
    • Before returning to user space, the kernel stores the process’s TRAPFRAME address in sscratch.
    • Because all 32 general-purpose registers belong to the interrupted user code, uservec starts by executing csrrw to swap a0 with sscratch.
    • a0 now holds a pointer to the process’s trapframe, mapped at TRAPFRAME just below TRAMPOLINE.
    • uservec saves all 32 user registers into the trapframe, which has space reserved for them.
    • The kernel also keeps a physical pointer to the same page in p->trapframe.
    • It extracts the kernel stack pointer, hartid, usertrap function address, and kernel page table address from the trapframe.
    • It updates satp to the kernel page table and jumps to the usertrap C function.
  • C Handler (usertrap):
    • Updates stvec to point to kernelvec, ensuring that any traps occurring during kernel execution are routed correctly.
    • Trap entry clears SIE, but xv6 later re-enables interrupts in selected kernel paths, especially before running syscall code, so device and timer interrupts can still be handled while the kernel is executing.
    • Saves sepc into the trapframe, since usertrap may yield and another process may run before this one resumes.
    • Identifies the trap cause and routes it:
      • invokes syscall for system calls,
      • devintr for device interrupts, or
      • kills the process for illegal exceptions (in basic xv6, all user page faults are treated as illegal exceptions).
    • If handling a system call, it increments the saved sepc by 4, ensuring the process resumes at the instruction immediately following the ecall.
    • On the way out, usertrap checks whether the process was killed and yields on a timer interrupt.
  • C Return Preparation (usertrapret):
    • Prepares the control registers for a future user trap by pointing stvec back to uservec.
    • Populates the trapframe fields required by uservec and sets sepc to the saved user program counter.
    • Calls userret on the trampoline page, passing the TRAPFRAME address and the user page table pointer.
  • Assembly Exit (userret):
    • Switches satp back to the user page table.
    • After that switch, it can rely only on registers and the trapframe, since ordinary kernel mappings are gone.
    • Restores the 32 user registers from the trapframe, performs a final swap of a0 and sscratch to restore the user’s a0, and executes sret to re-enter user mode.

The most common deliberate trap from user space is a system call, which utilizes the trapframe infrastructure to pass instructions and data securely to the kernel.

System Call Mechanisms

User programs initiate system calls:

  • placing arguments into specific registers (e.g., a0, a1),
  • placing the system call number into a7, and
  • executing ecall.
  • This follows the RISC-V calling convention, so syscall arguments begin in registers.

Once the trap mechanism hands control to the syscall function, the kernel uses the saved a7 value to index into the syscalls array, which acts as a dispatch table mapping numbers to implementation functions.

Upon completion, the system call’s return value is written to p->trapframe->a0, overwriting the first argument so the user code receives the result. By convention, negative numbers indicate errors, while zero or positive numbers indicate success.

System calls must frequently access arguments and memory provided by the user process:

  • The functions argint, argaddr, and argfd extract integers, pointers, and file descriptors from the saved registers in the trapframe; they use argraw to read the raw saved register.
  • Pointer arguments create two problems: they may be invalid or malicious, and they refer to user virtual addresses, not kernel mappings.
  • The kernel uses functions like fetchstr and copyinstr to safely read string data from user space.
  • copyinstr walks the target process’s page table, which is not the current page table.
  • walkaddr translates the user virtual address to a physical address and checks that it belongs to user memory.
  • After translation, direct mapping lets the kernel copy bytes using the corresponding kernel virtual address.
  • copyout performs the reverse direction, copying data from kernel space to a user address.

While the complex trampoline mechanism safely handles transitions from user space, traps that occur while already executing inside the kernel require a much simpler control flow.

Traps from Kernel Space

When the CPU is executing kernel code, stvec points directly to the kernelvec assembly code. Because the trap originates in supervisor mode, the satp register is already pointing to the kernel page table, and the stack pointer is already set to a valid kernel stack.

  • kernelvec pushes all 32 registers directly onto the current kernel stack, safely preserving the state of the interrupted kernel thread.
  • Execution jumps to the kerneltrap C function.
  • kerneltrap handles:
    • device interrupts (devintr), or
    • triggers a kernel panic if an exception occurs, as kernel exceptions are always fatal errors.
  • If the trap is a timer interrupt and a process thread is active, kerneltrap invokes yield to allow other threads CPU time.
  • Because yield may switch threads and overwrite sepc and sstatus, kerneltrap securely saves and restores these hardware registers locally.
  • Control returns to kernelvec, which pops the registers off the stack and executes sret to resume the interrupted kernel code.

Page-Fault Exceptions

xv6’s default response is simple: a user-space exception kills the process, while a kernel-space exception panics the kernel.

Page faults occur when:

  • a virtual address use contains no mapping in the page table,
  • PTE_V is clear, or
  • the access violates permissions such as PTE_R, PTE_W, PTE_X, or PTE_U.

RISC-V distinguishes instruction, load, and store page faults. scause records the type, and stval records the faulting virtual address.

Real kernels use page faults more aggressively.

On Copy-on-write (COW) fork, parent and child initially share physical pages as read-only. A write causes a store page fault; the kernel allocates a new page, copies the old contents, updates the faulting PTE to a private writable page, and resumes. Reference counting decides when shared pages can be freed and avoids copying when a page is no longer shared. This makes fork much cheaper, especially for fork followed by exec.

Lazy allocation: sbrk grows the process size without immediately allocating pages or PTEs. The first access faults, and the kernel allocates and maps the page then. This avoids work for unused pages and spreads allocation cost over time.

Demand paging: exec can install invalid PTEs first and load code/data from disk only on fault. This reduces startup latency for large programs.

Paging to disk: when RAM is scarce, the kernel can evict pages to disk, mark their PTEs invalid, and page them back in on fault. If RAM is full, paging in one page may require evicting another. Paging works best when programs have good locality of reference.

Other page-fault uses include automatic stack growth and memory-mapped files.

Real-World Context

  • The trampoline and trapframe exist because RISC-V does very little on trap entry: it does not switch page tables, save general registers, or identify the current process for the kernel.
  • Thus the first trap-entry instructions must run in supervisor mode but still under the user page table, with user register contents still live.
  • xv6 relies on two protected handoff mechanisms:
    • sscratch to stash the trapframe pointer
    • user-page-table mappings to kernel-owned memory without PTE_U, so user code cannot access them
  • A faster alternative is to map kernel memory into every user page table. That removes the trampoline requirement, avoids switching page tables on user traps, and lets kernel code directly dereference user pointers.
  • Many real systems use that style for efficiency, but xv6 avoids it to reduce security risk from accidental user-pointer use and to avoid extra address-space-overlap complexity.
  • Real kernels also implement COW fork, lazy allocation, demand paging, paging to disk, memory-mapped files, and try to keep nearly all physical memory in use for applications or caches.
  • xv6 is intentionally simpler: if memory runs out, it usually returns an error or kills a process instead of reclaiming memory by evicting another page.

Trap Catalogue

From the ISA perspective:

User syscall

Number: Exception 8
Cause: Environment call from U-mode.

A user program executes ecall, RISC-V records scause = 8, and xv6’s usertrap() recognizes it as a syscall. xv6 then calls syscall() and advances the saved sepc by 4 so the process resumes after the ecall instruction.

User-mode mistakes

Numbers: Exceptions 0–7, 12, 13, 15
Causes:

  • misaligned instruction or data address,
  • access fault,
  • illegal instruction,
  • breakpoint,
  • instruction page fault,
  • load page fault,
  • store page fault.

These are traps caused by bad or unsupported user behavior. xv6 does not try to recover from these in the basic kernel. usertrap() prints diagnostic information and marks the process as killed.

Kernel-mode mistakes

Numbers: Exceptions 0–7, 12, 13, 15, (and 9) Causes:

  • same kinds of exceptions while xv6 is already running in supervisor mode.
  • exception 9 is ecall from S-mode.

These are treated much more seriously. xv6 considers a kernel bug. kerneltrap() panics instead of killing only one process.

Supervisor external interrupts

Number: Interrupt 9
Cause: Supervisor external interrupt.

These are device interrupts delivered to the S-mode kernel, usually through the PLIC (includes devices such as UART and virtio disk). usertrap() or kerneltrap() calls devintr(), and devintr() identifies and handles the device.

Machine timer interrupt (and Supervisor software interrupt)

Numbers: Interrupt 7, then interrupt 1
Causes:

  • machine timer interrupt, then
  • supervisor software interrupt.

This is xv6’s special timer path. The physical timer first causes a machine timer interrupt. xv6’s machine-mode timer code programs the next timer event and forwards the event into supervisor mode using a supervisor software interrupt. Then devintr() handles it as a clock interrupt.

Supervisor timer interrupt

Number: Interrupt 5
Cause: Supervisor timer interrupt.

This exists architecturally in RISC-V. Conceptually, it is a timer interrupt intended directly for S-mode. However, stock xv6’s main timer path is not built around this as the primary event; xv6 uses the machine timer path and forwards to S-mode.

Machine-level interrupts

Numbers: Interrupts 3, 11
Causes:

  • machine software interrupt
  • machine external interrupt.

These are real RISC-V interrupt causes, but they target machine mode. Normal xv6 kernel trap code runs in supervisor mode, so usertrap(), kerneltrap(), and devintr() do not normally receive these as ordinary xv6 traps. They belong to M-mode firmware or machine-mode runtime code.

Custom interrupts

Numbers: Interrupt 13, 16+
Causes:

  • Counter-overflow interrupt,
  • platform or custom interrupts.

These are for performance counters or platform-specific interrupt sources. Stock xv6 does not really use them. A more advanced OS could use counter overflow for profiling, but xv6 keeps interrupt handling minimal.

Custom exceptions

Numbers: Exceptions 10, 14, 16–19, 20+
Causes:

  • reserved,
  • custom,
  • double trap,
  • software check,
  • hardware error.

These are not part of the normal xv6 teaching path. If one somehow occurs from user mode, xv6 would generally treat it as an unexpected user exception and kill the process. If it occurs in kernel mode, xv6 would panic.

Interrupts and Device Drivers

Driver Architecture

Drivers manage specific hardware devices by configuring hardware, initiating operations, handling resulting interrupts, and interacting with waiting processes. Device interrupts are a class of traps routed through the kernel’s trap handling logic (e.g., devintr). Driver execution is structured into two concurrent contexts:

  • Top half: Executes within a process’s kernel thread via system calls (e.g., read, write). Asks the hardware to initiate operations and yields the CPU to wait for completion.
  • Bottom half: Executes asynchronously at interrupt time. Identifies completed operations, wakes waiting processes, and issues the next pending hardware command.

Separating device management into process-driven top halves and asynchronous bottom halves provides the architectural foundation for handling unpredictable external events, such as console input.

Initialization

xv6 configures the console/UART once during boot so later console input can be handled by interrupts instead of polling.

  • Purpose: xv6 uses the UART as the console device for keyboard input and terminal output.
  • QEMU setup: In QEMU, the UART is simulated. The keyboard/display are connected to xv6 through QEMU’s emulated UART.
  • UART hardware model: xv6 talks to a 16550 UART chip, emulated by QEMU.
  • Memory-mapped I/O: UART registers are exposed at physical addresses, not normal RAM.
  • UART base address: UART0 = 0x10000000
  • Control registers: UART registers are byte-wide and accessed using offsets from UART0.
  • Receive side:
    • UART stores received input bytes in an internal FIFO.
    • LSR register tells whether input is ready.
    • RHR register is used to read received bytes.
    • Reading from RHR removes that byte from the UART FIFO.
  • Transmit side:
    • Software writes output bytes to THR.
    • UART sends those bytes to the terminal/display.
  • Initialization entry point:
    • main() calls consoleinit().
  • What consoleinit() does:
    • Initializes the console lock.
    • Connects console read/write functions to xv6’s device switch table.
    • Calls uartinit() to initialize UART hardware.
  • What uartinit() configures:
    • Enables UART receive interrupts.
    • Enables UART transmit-complete interrupts.
    • Sets up UART so xv6 is notified when input arrives or output is ready for more bytes.

Input:

Output:

Concurrency

Driver data structures are vulnerable to three distinct concurrency vectors that require lock protection:

  • Simultaneous execution of top-half routines by multiple processes on different CPUs.
  • Hardware interrupting a CPU while it is mid-execution in a top-half routine.
  • Hardware delivering an interrupt on a secondary CPU concurrently with top-half execution on a primary CPU.

Interrupt delivery creates a separate constraint: the process waiting for a device may not be the process running when the interrupt arrives, and there may be no current user process at all. Thus interrupt handlers cannot assume the interrupted process is the one waiting for the device or safely use its page table (e.g., copyout with the current process). Bottom halves therefore do minimal work: copy data into a kernel buffer, update device state, and wake the top half to finish the rest.

Timer Interrupts

  • Periodic timer interrupts drive the system clock and enforce preemptive thread scheduling via yield.
  • xv6 uses them to maintain time and to switch among compute-bound processes; the yield calls in usertrap and kerneltrap perform that switching.
  • RISC-V requires timer interrupts to trap into machine mode, not supervisor mode. Machine mode runs without paging and uses separate control registers, so xv6 handles timer interrupts separately from the ordinary supervisor trap path.
  • start.c, before main, sets up timer delivery:
    • programs the CLINT to interrupt again after a fixed delay,
    • prepares a scratch area analogous to a trapframe,
    • sets mtvec to timervec, and
    • enables machine-mode timer interrupts.
  • A timer interrupt may arrive while either user or kernel code is running, so the machine-mode handler must avoid disturbing the interrupted supervisor code.
  • timervec therefore does only minimal assembly work:
    • saves a few registers in the scratch area,
    • programs the next timer interrupt,
    • raises a supervisor software interrupt,
    • restores registers, and
    • returns.
  • The forwarded software interrupt is then handled in supervisor mode through the normal trap path (devintr). There is no C code in the machine-mode timer handler.

Real-World

Xv6 allows device and timer interrupts while running kernel code as well as user code. Timer interrupts may call yield even from kernel context, which helps time-slice compute-bound kernel threads. The cost is extra complexity: kernel code must tolerate being suspended and later resumed, possibly on a different CPU. A simpler kernel could permit interrupts only while running user code, but that would reduce fairness and responsiveness for long-running kernel work.

On real systems, drivers often account for more code than the core kernel because there are many devices, many features, and often poorly documented protocols.

  • Programmed I/O: The driver moves data by explicitly reading and writing device registers. This is simple, but too slow for high-rate devices.
  • DMA: The device transfers data directly to and from RAM. The driver prepares buffers in memory and kicks off the transfer with a control-register write.
  • Interrupt mitigation and polling:
    • High-speed devices often batch work and raise one interrupt for many completions.
    • Under heavy load, drivers may disable interrupts and poll device state instead.
    • Some systems switch dynamically between polling and interrupts based on load.

For very fast devices, copying data first into a kernel buffer and then into user space can be too expensive. Real systems may use zero-copy techniques, often built around DMA, to move data more directly between devices and user buffers. When applications need device-specific controls that do not fit read and write, Unix systems expose them through ioctl.

Xv6 is also unsuitable for hard or soft real-time work. Its scheduler is too simple, and some kernel paths keep interrupts disabled for too long to guarantee bounded response times.