struct proc holds all kernel state for one process.
| Field | Meaning |
|---|---|
lock | Protects this process slot. |
state | Current process state: UNUSED, USED, SLEEPING, RUNNABLE, RUNNING, or ZOMBIE. |
chan | Sleep channel if the process is sleeping. |
killed | Set when the process has been marked for termination. |
xstate | Exit status reported to the parent. |
pid | Process ID. |
parent | Parent process pointer. |
kstack | Virtual address of this process’s kernel stack. |
sz | Size of the process’s user memory in bytes. |
pagetable | User page table for this process. |
trapframe | Saved user register state for trap entry and return. |
context | Saved kernel context used by swtch. |
ofile[NOFILE] | Open file table for this process. |
cwd | Current working directory inode. |
name[16] | Process name for debugging. |
struct cpu is xv6’s per-hart bookkeeping object.
- records which process this CPU is currently running
- resume the scheduler after a process stops
- enough interrupt state for nested critical sections.
- Must be read with interrupts off when CPU-local identity matters, so the process cannot move to another CPU mid-read.
| Field | Meaning |
|---|---|
proc | Process currently running on this CPU, or 0 if none. |
context | Saved scheduler context used when switching back from a process. |
noff | Nesting depth of push_off interrupt-disable sections. |
intena | Whether interrupts were enabled before the first push_off. |
Process-table interfaces:
allocproc
- Finds an
UNUSEDprocess slot. - Changes it to
USEDand assigns a PID. - Allocates the trapframe page.
- Calls
proc_pagetableto create the initial user page table. - The initial user page table contains trampoline and trapframe mappings.
- Initializes the kernel context so the process starts at
forkret. - Sets
p->context.spto the top of the process kernel stack. - Returns with
p->lockstill held.
proc_pagetable
- Creates an empty user page table for one process.
- Calls
uvmcreateto allocate the empty root page table. - Maps
TRAMPOLINEat the top of user virtual memory. - Maps the process’s
TRAPFRAMEjust belowTRAMPOLINE. - Uses
mappagesfor both mappings. - Uses
uvmunmapanduvmfreeto clean up if mapping fails. - Leaves ordinary user memory unmapped until
exec,fork, or memory growth.
forkret
- First kernel function a newly scheduled process runs.
- Releases
p->lock, which the scheduler still holds afterswtch. - Runs
fsinit(ROOTDEV)only for the first process. - Executes
/initonly for the first process throughkexec. - Uses
prepare_returnand trampolineuserretto enter user mode.
First process setup path:
| Step | Caller | Function | State | What happens next |
|---|---|---|---|---|
| 1 | userinit | Calls allocproc | UNUSED -> USED | A first process slot is prepared and returned locked. |
| 2 | userinit | p->cwd = / | Still USED | The kernel records this process as the special init process. |
| 3 | userinit | p->state | USED -> RUNNABLE | The scheduler can choose it. |
| 4 | userinit | release | Still RUNNABLE | Other CPUs and the scheduler can safely inspect the slot. |
Child process setup path:
| Step | Caller | Function | State | What happens next |
|---|---|---|---|---|
| 1 | fork syscall | Calls allocproc | UNUSED -> USED | A child process slot is claimed and returned locked. |
| 4 | kfork | uvmcopy | Still USED | Parent user memory is copied into the child. |
| 5 | kfork | Copies parent trapframe | Still USED | Child starts with the same saved user registers. |
| 6 | kfork | np->tf->a0 = 0 | Still USED | fork() will return 0 in the child. |
| 7 | kfork | Duplicates open files | Still USED | Child shares references to the parent’s open files. |
| 8 | kfork | Duplicates cwd | Still USED | Child starts in the same current directory. |
| 9 | kfork | Copies process name | Still USED | Child gets the parent’s debug name. |
| 10 | kfork | release | Still USED | Child resource setup is complete. |
| 11 | kfork | np->parent = p | Still USED | Parent-child relationship is recorded under wait_lock. |
| 12 | kfork | acquire | Still USED | The child slot is locked again before state change. |
| 13 | kfork | np->state | USED -> RUNNABLE | The scheduler can choose the child. |
| 14 | kfork | release | Still RUNNABLE | Other CPUs and the scheduler can inspect the child. |
At this point, process creation has only made a process eligible to run. The transition to actually executing is the scheduler’s job:
scheduler
- Runs forever on each CPU after boot setup.
- Updates
struct cpu: usesc->procandc->contextfor this CPU. - Clears
c->procwhen the CPU is not running a process. - Briefly enables then disables interrupts before scanning.
- This avoids deadlock when all processes sleep, while still avoiding a race with
wfi. - Scans
proc[]for aRUNNABLEprocess. - Locks each process slot before checking its state.
- Changes the chosen process to
RUNNING. - Stores the process in this CPU’s
c->proc. - Calls
swtch(&c->context, &p->context)to enter the process. swtchsaves the scheduler’s CPU context inc->context.- The process must change its own state before switching back.
- Clears
c->procafter the process returns to the scheduler. - Resumes scanning when the process yields, sleeps, or exits back to the scheduler.
- Runs
wfiif no process is runnable, so the CPU waits for an interrupt.
swtch
- Assembly context-switch.
- Saves current
ra,sp, and callee-saved registers into the old context. - Loads
ra,sp, and callee-saved registers from the new context. scheduleruses it to enter a process.scheduses it to return to the scheduler.
Running, sleeping, and ending:
| Case | Caller / event | Main path | From | To |
|---|---|---|---|---|
| Give up CPU | Timer interrupt or process path | yield | RUNNING | RUNNABLE |
| Wait for event | Kernel code needing a condition | sleep | RUNNING | SLEEPING |
| Event happens | Producer or interrupt handler | wakeup | SLEEPING | RUNNABLE |
| Kill sleeping process | kill syscall path | kkill | SLEEPING | RUNNABLE |
| Exit current process | exit syscall path or fatal user condition | kexit | RUNNING | ZOMBIE |
| Reap child | Parent wait syscall path | kwait | ZOMBIE | UNUSED |
sched
- Low-level switch back to the scheduler.
- Updates
struct cpu: switches back tomycpu()->contextand restoresmycpu()->intena. - Requires the process to already hold
p->lock. - Requires the process state to already be changed away from
RUNNING. - Calls
swtch(&p->context, &mycpu()->context).
yield
- Voluntarily gives up the CPU for one scheduling round.
- Sets the current process back to
RUNNABLE. - Calls
schedto return to the scheduler. - Updates CPU state indirectly through
sched.
sleep
- Puts the current process to sleep on a channel.
- Acquires
p->lock, releases the condition lock, then setsp->chanandSLEEPING. - Calls
sched, then clearsp->chanafter wakeup. - Updates CPU state indirectly through
sched. - Reacquires the original condition lock before returning.
wakeup
- Scans
proc[]for processes sleeping on a channel. - Changes matching processes from
SLEEPINGtoRUNNABLE.
kexit
- Closes open files and releases the current directory.
- Calls
reparentto give abandoned children toinitproc. - Calls
wakeup(p->parent)in case the parent is waiting. - Stores exit status and changes the process to
ZOMBIE. - Calls
schedand never returns. - Updates CPU state indirectly through
sched.
(calls all of these: fileclose, iput, reparent, wakeup, sched)
kwait
- Parent scans
proc[]for zombie children. - Uses
copyoutto return the child’s exit status to user space when requested. - Calls
freeprocto release the zombie child slot. - Calls
sleep(p, &wait_lock)if children exist but none are zombies yet.
(copyout, freeproc, sleep)
kkill
- Finds a process by PID and sets
p->killed. - If the victim is sleeping, changes it to
RUNNABLE. - The victim actually exits later when it checks
killed.
freeproc
- Releases process-owned kernel resources.
- Frees the trapframe page if present.
- Frees the user page table and user memory if present.
- Clears process metadata and returns the slot to
UNUSED. - Requires
p->lockto already be held. (need to mention free proctable too)
(need growproc as well)
Scheduling
Multiplexing
Any operating system runs more processes than it has CPUs, so the CPUs must be time-shared among processes. The goal is to make this transparent: each process should have the illusion of its own virtual CPU. xv6 achieves this through multiplexing, switching each CPU between processes in two situations:
- When a process waits for I/O, a child to exit, or calls
sleep, xv6 switches to another process voluntarily. - When a process computes for too long without sleeping, xv6 uses periodic hardware timer interrupts to force a switch.
This creates the same kind of illusion that virtual memory creates for physical memory: each process believes it has exclusive access to a resource it actually shares.
Implementing multiplexing cleanly raises several challenges:
- How to switch from one process to another, context switching being simple in concept but complex in implementation.
- How to force switches transparently, which xv6 handles via hardware timer interrupts.
- How to avoid races when all CPUs share the same process table, requiring a careful locking plan.
- How to free a process’s memory and resources on exit, given that a process cannot free its own kernel stack while still using it.
- How each CPU core tracks which process it is running, so system calls affect the correct kernel state.
- How to avoid lost wakeup races in sleep and wakeup.
xv6 solves these as simply as possible, but the resulting code is still among the trickiest in the kernel.
Context Switching
A context switch from one user process to another involves four steps:
- A user-kernel transition on the old process.
- A context switch to the CPU’s scheduler thread.
- A context switch to the new process’s kernel thread.
- A trap return to user space.
Each CPU has its own dedicated scheduler thread with its own stack and saved registers. This is necessary because it is not safe for the scheduler to run on the old process’s kernel stack: another core might wake that process and run it, and two cores executing on the same stack would be a disaster.
At the hardware level, a context switch saves the current thread’s registers and restores a previously saved set. The stack pointer and return address are among the saved registers, so the CPU switches both stacks and the code it is executing. The swtch function performs this save and restore for kernel thread switches. It operates on struct context, which holds 32 callee-saved RISC-V registers. Caller-saved registers are handled by the C compiler, which saves them on the stack before calling swtch.
Key properties of swtch:
- It takes two arguments, a pointer to the old context and a pointer to the new context. It saves current registers into the old, loads from the new, and returns.
- It saves the
raregister (the return address) rather than the program counter directly. - When it returns, it resumes at the instruction pointed to by the restored
ra, on the new thread’s stack.
The typical flow during a voluntary yield:
usertrapcallsyield.yieldcallssched.schedcallsswtch, saving the process’s context and switching to the per-CPU scheduler context.- When
swtchreturns, execution resumes insidescheduler, notsched, on the scheduler’s stack.

The Scheduler
The scheduler runs as a special per-CPU thread, executing a loop that finds a runnable process, runs it until it yields, and repeats. A process that wants to give up the CPU must follow a specific sequence before calling sched:
- Acquire its own process lock
p->lock. - Release any other locks it holds.
- Update its state in
p->state. - Call
sched.
The sched verifies these requirements, then calls swtch to save the current context and switch to the scheduler context. The scheduler then continues its loop:
- Finds the next runnable process.
- Sets
c->proc. - Marks it
RUNNING. - Calls
swtchto run it.
The sched and scheduler are co-routines: sched always switches to scheduler, and scheduler almost always switches back to some thread that previously called sched. The one exception is a newly created process, whose context is set up by allocproc to jump to forkret on its first switch. forkret releases p->lock and returns to user space as if returning from fork.
xv6 holds p->lock across calls to swtch, meaning the thread that acquires the lock is not the one that releases it. This breaks the usual locking convention but is necessary because p->lock protects invariants on the process’s state and context fields that do not hold during swtch itself. Without the lock, another CPU could see the process as RUNNABLE and start running it before swtch has finished, putting two CPUs on the same kernel stack.
The invariants p->lock enforces depend on the process state:
- RUNNING: the CPU registers hold the process’s values, and
c->procrefers to it. A timer interrupt must be able to yield safely at any point. - RUNNABLE:
p->contextholds the saved registers, no CPU is executing on its kernel stack, and noc->procpoints to it. An idle scheduler must be able to pick it up safely.
These invariants are the reason p->lock is often acquired in one thread and released in another. yield acquires it and scheduler releases it; scheduler acquires it and yield releases it. The lock must be held continuously across the transition to keep the invariants intact.
Per-CPU State
On a uniprocessor, a global variable pointing to the current process would suffice. On a multi-core machine, each core executes a different process, so per-core state is needed. xv6 solves this by dedicating one hardware register per CPU to help locate per-core information.
Each CPU has a struct cpu that records the process currently running on it, the saved registers for the scheduler thread, and the nesting count of spinlocks for interrupt management. mycpu returns a pointer to the current CPU’s struct cpu by reading the tp register, which holds the CPU’s hardware thread ID (hartid). xv6 uses hartid to index into an array of struct cpu to find the right one.
Keeping tp correct across all transitions requires care:
mstartsetstpearly in the boot sequence while still in machine mode.usertrapretsavestpinto the trampoline page before returning to user space, since user code could modify it.uservecrestores the savedtpwhen entering the kernel from user space.
The return value of mycpu is fragile: a timer interrupt could move the thread to a different CPU, making the returned pointer stale. Callers must disable interrupts before calling mycpu and only re-enable them after they are done using the result.
The myproc returns the struct proc of the process running on the current CPU. It disables interrupts, calls mycpu, reads c->proc, and re-enables interrupts. Unlike mycpu, the return value of myproc remains valid even after interrupts are re-enabled, since a process’s struct proc pointer does not change when it moves between CPUs.
Sleep and Wakeup
Locks and scheduling hide concurrent activity from threads, but sometimes threads need to intentionally wait for each other. A pipe reader may need to wait for a writer to produce data; a parent calling wait may need to wait for a child to exit; a process reading the disk must wait for the hardware to finish. xv6 handles these cases with sleep and wakeup, also called sequence coordination or conditional synchronization mechanisms.
sleep(chan, lock) puts the calling process to sleep on an arbitrary wait channel chan. wakeup(chan) wakes all processes sleeping on that channel. The wait channel can be any mutually convenient value; xv6 typically uses the address of a kernel data structure involved in the waiting.
To show why the interface must be designed carefully, consider building a semaphore using sleep and wakeup. A semaphore maintains a count with two operations:
- V (producer): increments the count and calls
wakeup. - P (consumer): waits until the count is non-zero, then decrements it.
A naive P that spins checking the count wastes CPU time. Replacing the spin with sleep seems natural but introduces the lost wakeup problem: if V runs between P’s check of the count and P’s call to sleep, V finds no sleeping process and does nothing. P then sleeps waiting for a wakeup that has already happened. Having P acquire the lock before checking the count avoids this race but causes a deadlock: P holds the lock while sleeping, so V can never acquire it.
The correct fix is to pass the condition lock into sleep. The caller acquires the lock before checking the condition, then passes it to sleep, which atomically marks the process as sleeping and releases the lock. This ensures no other process can call wakeup between the check and the sleep.
How Sleep and Wakeup Work
When sleep is called, it acquires p->lock while still holding the condition lock. At this point both locks are held. sleep can then safely release the condition lock: any concurrent call to wakeup will block on p->lock and wait until the process is fully marked SLEEPING. Once p->lock is held alone, sleep records the channel, sets the state to SLEEPING, and calls sched to yield the CPU. p->lock is not released until after the process is marked SLEEPING, which is why the ordering is critical.
wakeup loops over the process table, acquires p->lock for each process it inspects, and sets any SLEEPING process with a matching channel to RUNNABLE. The next scheduler pass will pick it up. wakeup should be called while holding the condition lock so that the waker and the sleeper cannot miss each other.
The no-lost-wakeup guarantee holds because the sleeping process holds either the condition lock or p->lock or both from before it checks the condition until after it is marked SLEEPING. The waker holds both locks in its loop, so it either makes the condition true before the sleeper checks it, or it finds the process already marked SLEEPING and wakes it.
Two additional points:
- Multiple processes may sleep on the same channel. A single
wakeupwakes them all, but only one will find the condition true. The rest will find no data and must sleep again. For this reason,sleepis always called inside awhileloop that re-checks the condition. - Two unrelated uses of sleep/wakeup may accidentally share a channel, producing spurious wakeups. The
whileloop handles this safely.
Pipes
A pipe is a practical example of sleep/wakeup coordinating producers and consumers. Each pipe is a struct pipe containing a lock, a circular data buffer, and two counters: nread (total bytes read) and nwrite (total bytes written). The buffer wraps around, but the counters do not, which lets the implementation distinguish a full buffer (nwrite == nread + PIPESIZE) from an empty one (nwrite == nread). Indexing into the buffer uses buf[nread % PIPESIZE] rather than buf[nread] directly.
Consider pipewrite and piperead running simultaneously on two CPUs:
pipewriteacquires the pipe lock and starts copying bytes into the buffer.pipereadtries to acquire the lock, finds it held, and spins.- If the buffer fills before
pipewritefinishes, it callswakeupon&pi->nreadto alert any sleeping readers, then sleeps on&pi->nwrite, releasing the lock in the process. - With the lock now free,
pipereadacquires it, finds data available, copies bytes out of the buffer, and incrementsnread. pipereadcallswakeupon&pi->nwriteto wake any sleeping writers, then returns.- The wakeup finds
pipewritesleeping on&pi->nwriteand marks itRUNNABLE. The scheduler will resume it shortly.
Pipes use separate sleep channels for readers (pi->nread) and writers (pi->nwrite). This avoids waking the wrong side unnecessarily when many readers and writers are waiting on the same pipe. Both piperead and pipewrite sleep inside a loop that re-checks the condition, so any process woken spuriously will simply sleep again.
Wait, Exit, and Kill
A parent and child can reach their wait/exit interaction in any order: the parent may already be sleeping in wait when the child exits, or it may call wait long after. xv6 handles this by having exit put the process into a ZOMBIE state. The child stays there until the parent’s wait notices it, frees the child’s resources, copies the exit status, and returns the child’s PID. If the parent exits before the child, the child is reparented to init, which perpetually calls wait, ensuring every process has a parent to clean up after it.
wait acquires wait_lock first, which acts as the condition lock to prevent the parent from missing a wakeup from an exiting child. It then scans the process table:
- If a child is in
ZOMBIEstate,waitfrees its resources and returns its PID. - If children exist but none have exited,
waitcallssleepand scans again when woken.
When holding multiple locks, wait always acquires wait_lock before any process’s p->lock to avoid deadlock.
exit performs several steps under both wait_lock and p->lock. It records the exit status, frees resources, reparents its children to init, wakes the parent, marks itself ZOMBIE, and permanently yields the CPU. wait_lock is held to prevent the parent from missing the wakeup. p->lock is held to prevent the parent from observing the ZOMBIE state before the child has finished calling swtch. Although exit wakes the parent before setting the state to ZOMBIE, this is safe: the parent’s loop in wait cannot examine the child until the scheduler releases p->lock, which happens only after the state is set.
kill does not destroy a process directly, since the victim may be executing on another CPU in the middle of sensitive kernel work. Instead, kill sets the victim’s p->killed flag and wakes it up if it is sleeping. The wakeup may cause sleep to return even though the condition is not yet satisfied, but since all sleep calls are inside while loops that re-check the condition, this is handled safely. When the victim next enters or exits the kernel, usertrap checks p->killed and calls exit.
Some sleep loops also check p->killed directly and abandon the current operation if it is set. However, not all do: loops in the middle of a multi-step atomic operation, such as a disk write sequence, do not check p->killed midway, since abandoning partway would leave the file system in an inconsistent state. A process killed during such an operation will not exit until the current system call completes.
Process Locking
p->lock is the most complex lock in xv6. At a basic level it must be held when reading or writing p->state, p->chan, p->killed, p->xstate, and p->pid, since other processes and scheduler threads on other cores can access these fields. Beyond that, p->lock protects higher-level invariants across xv6’s process data structures and algorithms. Its full set of responsibilities:
- Prevents races when allocating
proc[]slots for new processes. - Conceals a process from view while it is being created or destroyed.
- Prevents a parent’s
waitfrom collecting a process that has set its state toZOMBIEbut has not yet yielded the CPU. - Prevents another core’s scheduler from running a yielding process after it sets its state to
RUNNABLEbut before it finishesswtch. - Ensures only one core’s scheduler runs a
RUNNABLEprocess. - Prevents a timer interrupt from causing a process to yield while it is inside
swtch. - Together with the condition lock, prevents
wakeupfrom overlooking a process that is callingsleepbut has not finished yielding the CPU. - Prevents the victim of
killfrom exiting and being reallocated betweenkill’s check ofp->pidand its write ofp->killed. - Makes
kill’s check and write ofp->stateatomic.
p->parent is not protected by p->lock. It is protected by the global wait_lock, since only a process’s parent writes it, though other processes searching for their children may read it. wait_lock also acts as the condition lock when wait sleeps waiting for a child to exit: an exiting child holds either wait_lock or p->lock until after it has set its state to ZOMBIE, woken up its parent, and yielded the CPU. wait_lock serializes concurrent exits by a parent and child, guaranteeing init is woken up when it inherits an orphan. It is a global lock rather than a per-process lock because a process cannot know who its parent is until it acquires the lock.
Real World
xv6 uses round-robin scheduling, running each process in turn. Real operating systems implement priority-based policies that prefer high-priority runnable processes over low-priority ones. These policies grow complex quickly because competing goals such as fairness, throughput, and responsiveness must be balanced simultaneously. Complex policies can also produce unintended side effects:
- Priority inversion: a low-priority process holds a lock needed by a high-priority process, stalling it.
- Convoys: many high-priority processes queue behind a low-priority one that holds a shared lock; once formed, a convoy can persist for a long time.
The lost wakeup problem is a fundamental challenge for any sleep/wakeup mechanism. Different systems have tackled it differently:
- Early Unix simply disabled interrupts during sleep, which worked because Unix ran on a single CPU.
- xv6 and FreeBSD use an explicit lock passed into sleep.
- Plan 9’s sleep accepts a callback function that runs under the scheduling lock just before sleeping, as a last-minute condition check.
- Linux uses a wait queue with its own internal lock instead of a plain channel.
All of these share the same pattern: the sleep condition is protected by a lock that is dropped atomically as the process goes to sleep.
A related problem is the thundering herd: a single wakeup wakes all processes waiting on a channel, and they all race to check the condition. Most condition variable implementations address this with two primitives: signal, which wakes one process, and broadcast, which wakes all. Semaphores avoid this entirely by maintaining an explicit count of wakeups, which also naturally eliminates lost wakeups and spurious wakeups.
Process termination is complex in most operating systems. A process being killed may be deep inside the kernel with multiple stack frames that each require cleanup. Some languages provide exception mechanisms to help unwind the stack; C does not. Unix handles this partly through signals: another process can send a signal to a sleeping process, causing it to return from the system call with -1 and EINTR, letting the application decide what to do. xv6 does not support signals, so this complexity does not arise.
xv6’s kill support has known limitations. Some sleep loops do not check p->killed, so a killed process may not notice until the condition it is waiting for occurs somewhere down the line. There is also a race between sleep and kill: kill may set p->killed and attempt a wakeup just after the victim checks the flag but before it calls sleep, causing the kill to be missed entirely. A real operating system would also maintain an explicit free list for proc structures to allocate a free slot in constant time; xv6 uses a linear scan for simplicity.
This first process does not contain user code yet.
- The scheduler will later choose this
RUNNABLEprocess - The first context switch enters
forkret forkretruns in kernel mode on this process’s kernel stack- The first
forkretcall runsfsinit(ROOTDEV)because filesystem initialization may sleep - After the filesystem is ready,
forkretcallskexec("/init", ...) kexecloads/initfrom the filesystem into the process’s user address space- The trampoline return path then enters user mode
swtch only switches kernel contexts.
- It saves
ra,sp, and callee-saved registers into the oldstruct context - It loads
ra,sp, and callee-saved registers from the newstruct context - It does not switch user registers directly
- User registers are handled separately by the trapframe and trampoline path
The first process reaches user mode through this path.
userinitcreated aRUNNABLEprocess whose context starts atforkretschedulerpicks that process and switches intop->contextforkretrunsfsinit(ROOTDEV)the first timeforkretcallskexec("/init", ...)prepare_returnand trampoline code return into user mode/initstarts running as the first user program
If no process is runnable, the CPU executes wfi and waits for an interrupt.
When a runnable process is found:
p->statechanges fromRUNNABLEtoRUNNINGc->procis set to that processswtch(&c->context, &p->context)saves the scheduler context and loads the process context- The process starts or resumes in kernel mode
- When the process later gives up the CPU, it switches back to
c->context c->procis cleared after control returns to the scheduler