struct proc holds all kernel state for one process.

FieldMeaning
lockProtects this process slot.
stateCurrent process state: UNUSED, USED, SLEEPING, RUNNABLE, RUNNING, or ZOMBIE.
chanSleep channel if the process is sleeping.
killedSet when the process has been marked for termination.
xstateExit status reported to the parent.
pidProcess ID.
parentParent process pointer.
kstackVirtual address of this process’s kernel stack.
szSize of the process’s user memory in bytes.
pagetableUser page table for this process.
trapframeSaved user register state for trap entry and return.
contextSaved kernel context used by swtch.
ofile[NOFILE]Open file table for this process.
cwdCurrent working directory inode.
name[16]Process name for debugging.

struct cpu is xv6’s per-hart bookkeeping object.

  • records which process this CPU is currently running
  • resume the scheduler after a process stops
  • enough interrupt state for nested critical sections.
  • Must be read with interrupts off when CPU-local identity matters, so the process cannot move to another CPU mid-read.
FieldMeaning
procProcess currently running on this CPU, or 0 if none.
contextSaved scheduler context used when switching back from a process.
noffNesting depth of push_off interrupt-disable sections.
intenaWhether interrupts were enabled before the first push_off.

Process-table interfaces:

allocproc

  • Finds an UNUSED process slot.
  • Changes it to USED and assigns a PID.
  • Allocates the trapframe page.
  • Calls proc_pagetable to create the initial user page table.
  • The initial user page table contains trampoline and trapframe mappings.
  • Initializes the kernel context so the process starts at forkret.
  • Sets p->context.sp to the top of the process kernel stack.
  • Returns with p->lock still held.

proc_pagetable

  • Creates an empty user page table for one process.
  • Calls uvmcreate to allocate the empty root page table.
  • Maps TRAMPOLINE at the top of user virtual memory.
  • Maps the process’s TRAPFRAME just below TRAMPOLINE.
  • Uses mappages for both mappings.
  • Uses uvmunmap and uvmfree to clean up if mapping fails.
  • Leaves ordinary user memory unmapped until exec, fork, or memory growth.

forkret

  • First kernel function a newly scheduled process runs.
  • Releases p->lock, which the scheduler still holds after swtch.
  • Runs fsinit(ROOTDEV) only for the first process.
  • Executes /init only for the first process through kexec.
  • Uses prepare_return and trampoline userret to enter user mode.

First process setup path:

StepCallerFunctionStateWhat happens next
1userinitCalls allocprocUNUSED -> USEDA first process slot is prepared and returned locked.
2userinitp->cwd = /Still USEDThe kernel records this process as the special init process.
3userinitp->stateUSED -> RUNNABLEThe scheduler can choose it.
4userinitreleaseStill RUNNABLEOther CPUs and the scheduler can safely inspect the slot.

Child process setup path:

StepCallerFunctionStateWhat happens next
1fork syscallCalls allocprocUNUSED -> USEDA child process slot is claimed and returned locked.
4kforkuvmcopyStill USEDParent user memory is copied into the child.
5kforkCopies parent trapframeStill USEDChild starts with the same saved user registers.
6kforknp->tf->a0 = 0Still USEDfork() will return 0 in the child.
7kforkDuplicates open filesStill USEDChild shares references to the parent’s open files.
8kforkDuplicates cwdStill USEDChild starts in the same current directory.
9kforkCopies process nameStill USEDChild gets the parent’s debug name.
10kforkreleaseStill USEDChild resource setup is complete.
11kforknp->parent = pStill USEDParent-child relationship is recorded under wait_lock.
12kforkacquireStill USEDThe child slot is locked again before state change.
13kforknp->stateUSED -> RUNNABLEThe scheduler can choose the child.
14kforkreleaseStill RUNNABLEOther CPUs and the scheduler can inspect the child.

At this point, process creation has only made a process eligible to run. The transition to actually executing is the scheduler’s job:

scheduler

  • Runs forever on each CPU after boot setup.
  • Updates struct cpu: uses c->proc and c->context for this CPU.
  • Clears c->proc when the CPU is not running a process.
  • Briefly enables then disables interrupts before scanning.
  • This avoids deadlock when all processes sleep, while still avoiding a race with wfi.
  • Scans proc[] for a RUNNABLE process.
  • Locks each process slot before checking its state.
  • Changes the chosen process to RUNNING.
  • Stores the process in this CPU’s c->proc.
  • Calls swtch(&c->context, &p->context) to enter the process.
  • swtch saves the scheduler’s CPU context in c->context.
  • The process must change its own state before switching back.
  • Clears c->proc after the process returns to the scheduler.
  • Resumes scanning when the process yields, sleeps, or exits back to the scheduler.
  • Runs wfi if no process is runnable, so the CPU waits for an interrupt.

swtch

  • Assembly context-switch.
  • Saves current ra, sp, and callee-saved registers into the old context.
  • Loads ra, sp, and callee-saved registers from the new context.
  • scheduler uses it to enter a process.
  • sched uses it to return to the scheduler.

Running, sleeping, and ending:

CaseCaller / eventMain pathFromTo
Give up CPUTimer interrupt or process pathyieldRUNNINGRUNNABLE
Wait for eventKernel code needing a conditionsleepRUNNINGSLEEPING
Event happensProducer or interrupt handlerwakeupSLEEPINGRUNNABLE
Kill sleeping processkill syscall pathkkillSLEEPINGRUNNABLE
Exit current processexit syscall path or fatal user conditionkexitRUNNINGZOMBIE
Reap childParent wait syscall pathkwaitZOMBIEUNUSED

sched

  • Low-level switch back to the scheduler.
  • Updates struct cpu: switches back to mycpu()->context and restores mycpu()->intena.
  • Requires the process to already hold p->lock.
  • Requires the process state to already be changed away from RUNNING.
  • Calls swtch(&p->context, &mycpu()->context).

yield

  • Voluntarily gives up the CPU for one scheduling round.
  • Sets the current process back to RUNNABLE.
  • Calls sched to return to the scheduler.
  • Updates CPU state indirectly through sched.

sleep

  • Puts the current process to sleep on a channel.
  • Acquires p->lock, releases the condition lock, then sets p->chan and SLEEPING.
  • Calls sched, then clears p->chan after wakeup.
  • Updates CPU state indirectly through sched.
  • Reacquires the original condition lock before returning.

wakeup

  • Scans proc[] for processes sleeping on a channel.
  • Changes matching processes from SLEEPING to RUNNABLE.

kexit

  • Closes open files and releases the current directory.
  • Calls reparent to give abandoned children to initproc.
  • Calls wakeup(p->parent) in case the parent is waiting.
  • Stores exit status and changes the process to ZOMBIE.
  • Calls sched and never returns.
  • Updates CPU state indirectly through sched.

(calls all of these: fileclose, iput, reparent, wakeup, sched)

kwait

  • Parent scans proc[] for zombie children.
  • Uses copyout to return the child’s exit status to user space when requested.
  • Calls freeproc to release the zombie child slot.
  • Calls sleep(p, &wait_lock) if children exist but none are zombies yet.

(copyout, freeproc, sleep)

kkill

  • Finds a process by PID and sets p->killed.
  • If the victim is sleeping, changes it to RUNNABLE.
  • The victim actually exits later when it checks killed.

freeproc

  • Releases process-owned kernel resources.
  • Frees the trapframe page if present.
  • Frees the user page table and user memory if present.
  • Clears process metadata and returns the slot to UNUSED.
  • Requires p->lock to already be held. (need to mention free proctable too)

(need growproc as well)

Scheduling

Multiplexing

Any operating system runs more processes than it has CPUs, so the CPUs must be time-shared among processes. The goal is to make this transparent: each process should have the illusion of its own virtual CPU. xv6 achieves this through multiplexing, switching each CPU between processes in two situations:

  • When a process waits for I/O, a child to exit, or calls sleep, xv6 switches to another process voluntarily.
  • When a process computes for too long without sleeping, xv6 uses periodic hardware timer interrupts to force a switch.

This creates the same kind of illusion that virtual memory creates for physical memory: each process believes it has exclusive access to a resource it actually shares.

Implementing multiplexing cleanly raises several challenges:

  • How to switch from one process to another, context switching being simple in concept but complex in implementation.
  • How to force switches transparently, which xv6 handles via hardware timer interrupts.
  • How to avoid races when all CPUs share the same process table, requiring a careful locking plan.
  • How to free a process’s memory and resources on exit, given that a process cannot free its own kernel stack while still using it.
  • How each CPU core tracks which process it is running, so system calls affect the correct kernel state.
  • How to avoid lost wakeup races in sleep and wakeup.

xv6 solves these as simply as possible, but the resulting code is still among the trickiest in the kernel.

Context Switching

A context switch from one user process to another involves four steps:

  1. A user-kernel transition on the old process.
  2. A context switch to the CPU’s scheduler thread.
  3. A context switch to the new process’s kernel thread.
  4. A trap return to user space.

Each CPU has its own dedicated scheduler thread with its own stack and saved registers. This is necessary because it is not safe for the scheduler to run on the old process’s kernel stack: another core might wake that process and run it, and two cores executing on the same stack would be a disaster.

At the hardware level, a context switch saves the current thread’s registers and restores a previously saved set. The stack pointer and return address are among the saved registers, so the CPU switches both stacks and the code it is executing. The swtch function performs this save and restore for kernel thread switches. It operates on struct context, which holds 32 callee-saved RISC-V registers. Caller-saved registers are handled by the C compiler, which saves them on the stack before calling swtch.

Key properties of swtch:

  • It takes two arguments, a pointer to the old context and a pointer to the new context. It saves current registers into the old, loads from the new, and returns.
  • It saves the ra register (the return address) rather than the program counter directly.
  • When it returns, it resumes at the instruction pointed to by the restored ra, on the new thread’s stack.

The typical flow during a voluntary yield:

  1. usertrap calls yield.
  2. yield calls sched.
  3. sched calls swtch, saving the process’s context and switching to the per-CPU scheduler context.
  4. When swtch returns, execution resumes inside scheduler, not sched, on the scheduler’s stack.

The Scheduler

The scheduler runs as a special per-CPU thread, executing a loop that finds a runnable process, runs it until it yields, and repeats. A process that wants to give up the CPU must follow a specific sequence before calling sched:

  1. Acquire its own process lock p->lock.
  2. Release any other locks it holds.
  3. Update its state in p->state.
  4. Call sched.

The sched verifies these requirements, then calls swtch to save the current context and switch to the scheduler context. The scheduler then continues its loop:

  1. Finds the next runnable process.
  2. Sets c->proc.
  3. Marks it RUNNING.
  4. Calls swtch to run it.

The sched and scheduler are co-routines: sched always switches to scheduler, and scheduler almost always switches back to some thread that previously called sched. The one exception is a newly created process, whose context is set up by allocproc to jump to forkret on its first switch. forkret releases p->lock and returns to user space as if returning from fork.

xv6 holds p->lock across calls to swtch, meaning the thread that acquires the lock is not the one that releases it. This breaks the usual locking convention but is necessary because p->lock protects invariants on the process’s state and context fields that do not hold during swtch itself. Without the lock, another CPU could see the process as RUNNABLE and start running it before swtch has finished, putting two CPUs on the same kernel stack.

The invariants p->lock enforces depend on the process state:

  • RUNNING: the CPU registers hold the process’s values, and c->proc refers to it. A timer interrupt must be able to yield safely at any point.
  • RUNNABLE: p->context holds the saved registers, no CPU is executing on its kernel stack, and no c->proc points to it. An idle scheduler must be able to pick it up safely.

These invariants are the reason p->lock is often acquired in one thread and released in another. yield acquires it and scheduler releases it; scheduler acquires it and yield releases it. The lock must be held continuously across the transition to keep the invariants intact.

Per-CPU State

On a uniprocessor, a global variable pointing to the current process would suffice. On a multi-core machine, each core executes a different process, so per-core state is needed. xv6 solves this by dedicating one hardware register per CPU to help locate per-core information.

Each CPU has a struct cpu that records the process currently running on it, the saved registers for the scheduler thread, and the nesting count of spinlocks for interrupt management. mycpu returns a pointer to the current CPU’s struct cpu by reading the tp register, which holds the CPU’s hardware thread ID (hartid). xv6 uses hartid to index into an array of struct cpu to find the right one.

Keeping tp correct across all transitions requires care:

  • mstart sets tp early in the boot sequence while still in machine mode.
  • usertrapret saves tp into the trampoline page before returning to user space, since user code could modify it.
  • uservec restores the saved tp when entering the kernel from user space.

The return value of mycpu is fragile: a timer interrupt could move the thread to a different CPU, making the returned pointer stale. Callers must disable interrupts before calling mycpu and only re-enable them after they are done using the result.

The myproc returns the struct proc of the process running on the current CPU. It disables interrupts, calls mycpu, reads c->proc, and re-enables interrupts. Unlike mycpu, the return value of myproc remains valid even after interrupts are re-enabled, since a process’s struct proc pointer does not change when it moves between CPUs.

Sleep and Wakeup

Locks and scheduling hide concurrent activity from threads, but sometimes threads need to intentionally wait for each other. A pipe reader may need to wait for a writer to produce data; a parent calling wait may need to wait for a child to exit; a process reading the disk must wait for the hardware to finish. xv6 handles these cases with sleep and wakeup, also called sequence coordination or conditional synchronization mechanisms.

sleep(chan, lock) puts the calling process to sleep on an arbitrary wait channel chan. wakeup(chan) wakes all processes sleeping on that channel. The wait channel can be any mutually convenient value; xv6 typically uses the address of a kernel data structure involved in the waiting.

To show why the interface must be designed carefully, consider building a semaphore using sleep and wakeup. A semaphore maintains a count with two operations:

  • V (producer): increments the count and calls wakeup.
  • P (consumer): waits until the count is non-zero, then decrements it.

A naive P that spins checking the count wastes CPU time. Replacing the spin with sleep seems natural but introduces the lost wakeup problem: if V runs between P’s check of the count and P’s call to sleep, V finds no sleeping process and does nothing. P then sleeps waiting for a wakeup that has already happened. Having P acquire the lock before checking the count avoids this race but causes a deadlock: P holds the lock while sleeping, so V can never acquire it.

The correct fix is to pass the condition lock into sleep. The caller acquires the lock before checking the condition, then passes it to sleep, which atomically marks the process as sleeping and releases the lock. This ensures no other process can call wakeup between the check and the sleep.

How Sleep and Wakeup Work

When sleep is called, it acquires p->lock while still holding the condition lock. At this point both locks are held. sleep can then safely release the condition lock: any concurrent call to wakeup will block on p->lock and wait until the process is fully marked SLEEPING. Once p->lock is held alone, sleep records the channel, sets the state to SLEEPING, and calls sched to yield the CPU. p->lock is not released until after the process is marked SLEEPING, which is why the ordering is critical.

wakeup loops over the process table, acquires p->lock for each process it inspects, and sets any SLEEPING process with a matching channel to RUNNABLE. The next scheduler pass will pick it up. wakeup should be called while holding the condition lock so that the waker and the sleeper cannot miss each other.

The no-lost-wakeup guarantee holds because the sleeping process holds either the condition lock or p->lock or both from before it checks the condition until after it is marked SLEEPING. The waker holds both locks in its loop, so it either makes the condition true before the sleeper checks it, or it finds the process already marked SLEEPING and wakes it.

Two additional points:

  • Multiple processes may sleep on the same channel. A single wakeup wakes them all, but only one will find the condition true. The rest will find no data and must sleep again. For this reason, sleep is always called inside a while loop that re-checks the condition.
  • Two unrelated uses of sleep/wakeup may accidentally share a channel, producing spurious wakeups. The while loop handles this safely.

Pipes

A pipe is a practical example of sleep/wakeup coordinating producers and consumers. Each pipe is a struct pipe containing a lock, a circular data buffer, and two counters: nread (total bytes read) and nwrite (total bytes written). The buffer wraps around, but the counters do not, which lets the implementation distinguish a full buffer (nwrite == nread + PIPESIZE) from an empty one (nwrite == nread). Indexing into the buffer uses buf[nread % PIPESIZE] rather than buf[nread] directly.

Consider pipewrite and piperead running simultaneously on two CPUs:

  1. pipewrite acquires the pipe lock and starts copying bytes into the buffer.
  2. piperead tries to acquire the lock, finds it held, and spins.
  3. If the buffer fills before pipewrite finishes, it calls wakeup on &pi->nread to alert any sleeping readers, then sleeps on &pi->nwrite, releasing the lock in the process.
  4. With the lock now free, piperead acquires it, finds data available, copies bytes out of the buffer, and increments nread.
  5. piperead calls wakeup on &pi->nwrite to wake any sleeping writers, then returns.
  6. The wakeup finds pipewrite sleeping on &pi->nwrite and marks it RUNNABLE. The scheduler will resume it shortly.

Pipes use separate sleep channels for readers (pi->nread) and writers (pi->nwrite). This avoids waking the wrong side unnecessarily when many readers and writers are waiting on the same pipe. Both piperead and pipewrite sleep inside a loop that re-checks the condition, so any process woken spuriously will simply sleep again.

Wait, Exit, and Kill

A parent and child can reach their wait/exit interaction in any order: the parent may already be sleeping in wait when the child exits, or it may call wait long after. xv6 handles this by having exit put the process into a ZOMBIE state. The child stays there until the parent’s wait notices it, frees the child’s resources, copies the exit status, and returns the child’s PID. If the parent exits before the child, the child is reparented to init, which perpetually calls wait, ensuring every process has a parent to clean up after it.

wait acquires wait_lock first, which acts as the condition lock to prevent the parent from missing a wakeup from an exiting child. It then scans the process table:

  • If a child is in ZOMBIE state, wait frees its resources and returns its PID.
  • If children exist but none have exited, wait calls sleep and scans again when woken.

When holding multiple locks, wait always acquires wait_lock before any process’s p->lock to avoid deadlock.

exit performs several steps under both wait_lock and p->lock. It records the exit status, frees resources, reparents its children to init, wakes the parent, marks itself ZOMBIE, and permanently yields the CPU. wait_lock is held to prevent the parent from missing the wakeup. p->lock is held to prevent the parent from observing the ZOMBIE state before the child has finished calling swtch. Although exit wakes the parent before setting the state to ZOMBIE, this is safe: the parent’s loop in wait cannot examine the child until the scheduler releases p->lock, which happens only after the state is set.

kill does not destroy a process directly, since the victim may be executing on another CPU in the middle of sensitive kernel work. Instead, kill sets the victim’s p->killed flag and wakes it up if it is sleeping. The wakeup may cause sleep to return even though the condition is not yet satisfied, but since all sleep calls are inside while loops that re-check the condition, this is handled safely. When the victim next enters or exits the kernel, usertrap checks p->killed and calls exit.

Some sleep loops also check p->killed directly and abandon the current operation if it is set. However, not all do: loops in the middle of a multi-step atomic operation, such as a disk write sequence, do not check p->killed midway, since abandoning partway would leave the file system in an inconsistent state. A process killed during such an operation will not exit until the current system call completes.

Process Locking

p->lock is the most complex lock in xv6. At a basic level it must be held when reading or writing p->state, p->chan, p->killed, p->xstate, and p->pid, since other processes and scheduler threads on other cores can access these fields. Beyond that, p->lock protects higher-level invariants across xv6’s process data structures and algorithms. Its full set of responsibilities:

  • Prevents races when allocating proc[] slots for new processes.
  • Conceals a process from view while it is being created or destroyed.
  • Prevents a parent’s wait from collecting a process that has set its state to ZOMBIE but has not yet yielded the CPU.
  • Prevents another core’s scheduler from running a yielding process after it sets its state to RUNNABLE but before it finishes swtch.
  • Ensures only one core’s scheduler runs a RUNNABLE process.
  • Prevents a timer interrupt from causing a process to yield while it is inside swtch.
  • Together with the condition lock, prevents wakeup from overlooking a process that is calling sleep but has not finished yielding the CPU.
  • Prevents the victim of kill from exiting and being reallocated between kill’s check of p->pid and its write of p->killed.
  • Makes kill’s check and write of p->state atomic.

p->parent is not protected by p->lock. It is protected by the global wait_lock, since only a process’s parent writes it, though other processes searching for their children may read it. wait_lock also acts as the condition lock when wait sleeps waiting for a child to exit: an exiting child holds either wait_lock or p->lock until after it has set its state to ZOMBIE, woken up its parent, and yielded the CPU. wait_lock serializes concurrent exits by a parent and child, guaranteeing init is woken up when it inherits an orphan. It is a global lock rather than a per-process lock because a process cannot know who its parent is until it acquires the lock.

Real World

xv6 uses round-robin scheduling, running each process in turn. Real operating systems implement priority-based policies that prefer high-priority runnable processes over low-priority ones. These policies grow complex quickly because competing goals such as fairness, throughput, and responsiveness must be balanced simultaneously. Complex policies can also produce unintended side effects:

  • Priority inversion: a low-priority process holds a lock needed by a high-priority process, stalling it.
  • Convoys: many high-priority processes queue behind a low-priority one that holds a shared lock; once formed, a convoy can persist for a long time.

The lost wakeup problem is a fundamental challenge for any sleep/wakeup mechanism. Different systems have tackled it differently:

  • Early Unix simply disabled interrupts during sleep, which worked because Unix ran on a single CPU.
  • xv6 and FreeBSD use an explicit lock passed into sleep.
  • Plan 9’s sleep accepts a callback function that runs under the scheduling lock just before sleeping, as a last-minute condition check.
  • Linux uses a wait queue with its own internal lock instead of a plain channel.

All of these share the same pattern: the sleep condition is protected by a lock that is dropped atomically as the process goes to sleep.

A related problem is the thundering herd: a single wakeup wakes all processes waiting on a channel, and they all race to check the condition. Most condition variable implementations address this with two primitives: signal, which wakes one process, and broadcast, which wakes all. Semaphores avoid this entirely by maintaining an explicit count of wakeups, which also naturally eliminates lost wakeups and spurious wakeups.

Process termination is complex in most operating systems. A process being killed may be deep inside the kernel with multiple stack frames that each require cleanup. Some languages provide exception mechanisms to help unwind the stack; C does not. Unix handles this partly through signals: another process can send a signal to a sleeping process, causing it to return from the system call with -1 and EINTR, letting the application decide what to do. xv6 does not support signals, so this complexity does not arise.

xv6’s kill support has known limitations. Some sleep loops do not check p->killed, so a killed process may not notice until the condition it is waiting for occurs somewhere down the line. There is also a race between sleep and kill: kill may set p->killed and attempt a wakeup just after the victim checks the flag but before it calls sleep, causing the kill to be missed entirely. A real operating system would also maintain an explicit free list for proc structures to allocate a free slot in constant time; xv6 uses a linear scan for simplicity.

This first process does not contain user code yet.

  • The scheduler will later choose this RUNNABLE process
  • The first context switch enters forkret
  • forkret runs in kernel mode on this process’s kernel stack
  • The first forkret call runs fsinit(ROOTDEV) because filesystem initialization may sleep
  • After the filesystem is ready, forkret calls kexec("/init", ...)
  • kexec loads /init from the filesystem into the process’s user address space
  • The trampoline return path then enters user mode

swtch only switches kernel contexts.

  • It saves ra, sp, and callee-saved registers into the old struct context
  • It loads ra, sp, and callee-saved registers from the new struct context
  • It does not switch user registers directly
  • User registers are handled separately by the trapframe and trampoline path

The first process reaches user mode through this path.

  • userinit created a RUNNABLE process whose context starts at forkret
  • scheduler picks that process and switches into p->context
  • forkret runs fsinit(ROOTDEV) the first time
  • forkret calls kexec("/init", ...)
  • prepare_return and trampoline code return into user mode
  • /init starts running as the first user program

If no process is runnable, the CPU executes wfi and waits for an interrupt.

When a runnable process is found:

  • p->state changes from RUNNABLE to RUNNING
  • c->proc is set to that process
  • swtch(&c->context, &p->context) saves the scheduler context and loads the process context
  • The process starts or resumes in kernel mode
  • When the process later gives up the CPU, it switches back to c->context
  • c->proc is cleared after control returns to the scheduler