A process acts as a program (object code stored on some media) in the midst of execution, encompassing object code alongside resources such as;

  • open files,
  • pending signals,
  • internal kernel data,
  • processor state,
  • a memory address space with one or more memory mappings,
  • one or more threads of execution,
  • and a data section containing global variables

Threads: The objects of activity within a process, containing a unique program counter, a process stack, and a set of processor registers.

  • Kernel schedules individual threads, not processes
  • Multithreaded programs consist of more than one thread
  • Linux does not differentiate between threads and processes

On modern operating systems, processes provide two virtualizations:

  • a virtualized processor (an illusion that it alone monopolizes the system despite possibly sharing the processor among hundreds of other processes)
  • a virtual memory (allocates and manages memory as if it alone owned all the memory in the system)

Note: Threads share the virtual memory abstraction whereas, each receives its own virtualized processor.

A Process Lifecycle:

  • A process begins its life via the fork() system call, duplicating an existing parent process into a new child process.
  • The fork() system call returns from the kernel twice: once in the parent process and again in the newborn child.
  • The exec() family of function calls creates a new address space and loads a new program into it. (via the clone() system calls)
  • A program exists via the exit() system call, terminating the process, freeing its resources, and placing it into a zombie state until the parent queries its status using wait4().

Another name for a process is a task. The Linux kernel internally refers to processes as tasks.

The Task Structure

The kernel stores the list of processes in a circular doubly linked list called the task list where each element is a process descriptor of the type struct task_struct which is defined in <linux/sched.h>.

  • contains all kernel-required information about a process, including open files, address space, pending signals, and process state.
  • consumes approximately 1.7 kilobytes on a 32-bit architecture.

Allocation: The task_struct is dynamically created via the slab allocator to facilitate object reuse and cache coloring.

The thread_info structure is defined on x86 in <asm/thread_info.h> which contains a direct pointer (task) to the actual task_struct:

struct thread_info {
	struct task_struct    *task;
	struct exec_domain    *exec_domain;
	__u32                  flags;
	__u32                  status;
	__u32                  cpu;
	int                    preempt_count;
	mm_segment_t           addr_limit;
	struct restart_block   restart_block;
	void                  *sysenter_return;
	int                    uaccess_err;
};

Prior to kernel 3.2 task_struct was stored at the end of the kernel stack of each process which allowed architectures with few registers, such as x86, to calculate the location of the process descriptor via the stack pointer without using an extra register to store the location.

A new structure, struct thread_info, was created that again lives at the bottom of the stack (for stacks that grow down) and at the top of the stack (for stacks that grow up). The new structure also makes it rather easy to calculate offsets of its values for use in assembly code.

Currently, the thread_info exists as a low level task data that entry.S needs immediate access to.

Process Identification (PID): A numerical value of opaque type pid_t (typically an int) identifying tasks. The default maximum is (short int limit) for backward compatibility, but optionally expandable to (controlled in <linux/threads.h>) via /proc/sys/kernel/pid_max.

The current Macro: Quickly looks up the process descriptor of the actively executing task. This macro must be independently implemented by each architecture.

TopicModern RISC-VModern x86
Where current livesIn the tp registerIn a per-CPU variable named current_task
Basic mental modelcurrent == tp*(cpu_base+current_task_offset)
Is it a register read?YesNo
Is it still fast?Yes, very fastYes, optimized CPU-local memory access
Main access costRegister readOne per-CPU memory load, usually cache-hot
get_current() doesReturns the value in tpReads this CPU’s current_task
Pointer storedstruct task_struct *struct task_struct *
Storage locationCPU registerKernel per-CPU memory area
One copy per CPU?Effectively yes, each CPU has its own tp valueYes, each CPU has its own current_task slot
How context switch updates itWrites next task pointer into tpWrites next task pointer into this CPU’s current_task
Needs CPU-number lookup?NoNo
Needs array indexing?NoNo
Relationship to thread_infothread_info is at offset 0 inside task_struct;thread_info is at offset 0 inside task_struct;
current_thread_info()Casts current to struct thread_info *Casts current to struct thread_info *
Why it is fasttp is a hardware registerPer-CPU base is already known to the CPU/kernel; access can be compiled into a direct CPU-local load
Typical low-level formRead tpRead from per-CPU area, often via gs:/CPU-local base mechanism
Cache behaviorNo memory access just to get currentMemory access, but current_task is cache-hot
Design styleKeep current task pointer in a dedicated registerKeep current task pointer in per-CPU storage

Process State

The state field of the process descriptor dictates the current condition via exactly one of five flags:

Task stateMeaningCan run?Signal behaviorTypical cause
TASK_RUNNINGTask is runnable: either currently executing or waiting on a runqueueYesNormal signal handlingActively running or ready to be scheduled
TASK_INTERRUPTIBLETask is sleeping while waiting for a condition/eventNoCan wake early if a signal is pendingWaiting for I/O, timers, locks, resources, etc.
TASK_UNINTERRUPTIBLETask is sleeping while waiting for a condition, but should not be interrupted by signalsNoIgnores signals for wakeup purposesShort critical waits, device I/O, waits that must complete
__TASK_TRACEDTask is stopped because it is being traced/debuggedNoControlled by debuggerptrace
__TASK_STOPPEDTask execution is stopped and it is not eligible to runNoStopped due to stop signalsSIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU

State Manipulation: Kernel changes a process’s state via set_current_state(state), which sets the state and forces ordering memory barriers on SMP systems.

Process Context: When a program executes a system call or triggers an exception (remember xv6?)

  • enters kernel-space and the kernel is executing on behalf of the process
  • and is in process context where the current macro is valid
  • upon exiting the kernel, the process resumes execution in user-space, unless a higher-priority process has become runnable in the interim, in which case the scheduler is invoked to select the higher priority process.
  • System calls and exception handlers are well-defined interfaces into the kernel. Other than process context there is interrupt context where the system is not running on behalf of a process but is executing an interrupt handler.

The Process Family Tree

All processes are descendants of the init process (PID 1), which is spawned by the kernel during the final boot steps.

  • starts init in the last step of the boot process
  • reads the system initscripts and executes more programs, eventually completing the boot process Every process on the system has exactly one parent. Likewise, every process has zero or more children.

The relationship pointers are stored directly in task_struct:

  • parent: A pointer to the parent’s task_struct.
  • children: A list of child processes.
  • sibling: A list element connecting direct children of the same parent.

Thus it is possible to obtain the process descriptor of its parent with the following code:

struct task_struct *my_parent = current->parent;

Similarly, it is possible to iterate over a process’s children with:

struct task_struct *task;
struct list_head *list;
list_for_each(list, &current->children) {
	task = list_entry(list, struct task_struct, sibling);
	/* task now points to one of current’s children */
}

The init task’s process descriptor is statically allocated as init_task so this code will always succeed:

struct task_struct *task;
for (task = current; task != &init_task; task = task->parent)
	;
/* task now points to init */

Oftentimes, however, it is desirable simply to iterate over all processes in the system. This is easy because the task list is a circular, doubly linked list.

  • To obtain the next task in the list, given any valid task, use
list_entry(task->tasks.next, struct task_struct, tasks)
  • Obtaining the previous task works the same way:
list_entry(task->tasks.prev, struct task_struct, tasks)

These two routines are provided by the macros

  • next_task(task) and
  • prev_task(task), respectively.

Finally, the macro which iterates over the entire task list. On each iteration, task points to the next task in the list:

struct task_struct *task;
for_each_process(task) {
	/* this pointlessly prints the name and PID of each task */
	printk(“%s[%d]\n”, task->comm, task->pid);
}

Note: It is expensive to iterate over every task in a system with many processes; code should have good reason (and no alternative) before doing so.

Process Creation

Unix process creation separates the spawn mechanism into

The first, fork(), creates a child process that is a copy of the current task.

  • differs from the parent only in its PID (which is unique),
  • its PPID (parent’s PID),
  • which is set to the original process), and
  • certain resources and statistics, such as pending signals, which are not inherited. The second function, exec(), loads a new executable into the address space and begins executing it.

Copy-on-Write (COW): A technique delaying or preventing the copying of data upon fork().

  • parent and child share a single, read-only copy of the address space.
  • duplication occurs strictly when data is written to.
  • reduces fork() overhead to only duplicating the parent’s page tables and generating a unique process descriptor.

Consequently, the duplication of resources occurs only when they are written; until then, they are shared read-only

Forking

Linux implements fork() via the clone() system call which takes a series of flags that specify which resources, if any, the parent and child process should share.

The fork(), vfork(), and __clone() library calls all invoke the clone() system call with the requisite flags.

The only benefit to vfork() is not copying the parent page tables entries. The vfork() system call is implemented via a special flag to the clone() system call.

  1. These enter the common creation path through kernel_clone() in kernel/fork.c. Older explanations often call this path do_fork().
  2. kernel_clone() calls copy_process(), which does most of the child-creation work as it copies registers and the needed parts of the process environment according to the clone flags; actually starting the child is left to the caller.
  3. copy_process() calls dup_task_struct(current, node).
  4. dup_task_struct() allocates a new task_struct, duplicates the architecture task state, and allocates a new kernel stack for the child.
  5. At this point, the child’s task_struct is mostly a copy of the parent’s, but it is not ready to run yet.
  6. copy_process() checks resource limits, including the per-user process limit and the global thread limit.
  7. copy_process() clears or reinitializes fields that should not be inherited directly.
  8. It clears flags such as PF_SUPERPRIV, PF_WQ_WORKER, PF_IDLE, and PF_NO_SETAFFINITY
  9. Then sets PF_FORKNOEXEC.
  10. It initializes child-specific lists and fields such as
    • children,
    • sibling,
    • pending signals,
    • CPU-time accounting, and
    • vfork_done.
  11. By default, p->vfork_done is set to NULL during copy_process().
  12. copy_process() then duplicates or shares major process resources depending on the clone_flags.
  13. For memory, copy_mm() handles the address space.
  14. If CLONE_VM is set, the child shares the parent’s mm_struct.
  15. If CLONE_VM is not set, copy_mm() calls dup_mm(), which duplicates the parent’s memory descriptor and calls dup_mmap() to duplicate the page-table state.
  16. For normal fork(), the page tables are duplicated enough to map the same physical pages, but the actual user pages are shared copy-on-write.
  17. For open files, filesystem state, signal handlers, namespaces, and related structures, the kernel either shares or copies them depending on related clone flags.
  18. alloc_pid() assigns the child a PID unless the task is a special init/idle-style task.
  19. copy_process() attaches the new task into kernel bookkeeping structures such as PID lists, process lists, and parent/child relationships.
  20. copy_process() returns a pointer to the new child task_struct.
  21. Back in kernel_clone(), the kernel obtains the visible PID value for the new task.
  22. If this is a normal fork/clone, kernel_clone() calls wake_up_new_task(p) to make the child runnable.
  23. After that, the parent returns from the syscall, and the child begins execution when scheduled.
  24. The scheduler may run the child first. This can reduce COW overhead if the child immediately calls exec().
  25. vfork() is a special case built using clone flags, mainly CLONE_VFORK and usually CLONE_VM.
  26. With vfork(), the child shares the parent’s address space instead of getting a duplicated mm_struct.
  27. Because the address space is shared, the parent must not continue executing concurrently with the child.
  28. In the CLONE_VFORK path, kernel_clone() sets before waking the child.
p->vfork_done = &vfork;
init_completion(&vfork);
  1. Then kernel_clone() calls wake_up_new_task(p) to start the child.
  2. After starting the child, the parent waits in wait_for_vfork_done() instead of immediately returning to user space.
  3. mm_release() is not part of creating the child.
  4. mm_release() matters later, when the child exits or calls exec() and stops using the shared address space.
  5. exit_mm_release() and exec_mm_release() call mm_release().
  6. Inside mm_release(), if tsk->vfork_done is non-NULL, the kernel calls complete_vfork_done(tsk).
  7. complete_vfork_done() completes the parent’s wait and clears tsk->vfork_done.
  8. The parent then wakes up and returns from vfork().
  9. At that point, the child has either exited or moved into a new address space through exec().

Threads of Execution

A thread is a normal task that shares selected resources with another task where each thread receives its own task_struct with its own execution state, such as registers and kernel stack. This differs from operating systems such as Microsoft Windows, where the kernel has explicit thread objects which are often describe threads as lightweight processes because the process owns the shared resources and the threads own the execution state.

Creation: Threads are created like normal tasks, but clone() receives flags that specify which resources the parent and child share.

  • CLONE_VM: Parent and child share the same address space.
  • CLONE_FS: Parent and child share filesystem information.
  • CLONE_FILES: Parent and child share open files.
  • CLONE_SIGHAND: Parent and child share signal handlers and blocked signals.

With these flags, the new task behaves like a normal child except that key resources are shared with the parent.

The same clone() mechanism also expresses normal process creation variants:

  • fork() behaves like clone(SIGCHLD, 0)
  • vfork() behaves like clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0)

The clone flags are defined in <linux/sched.h> and describe the resource sharing and setup behavior of the new task:

FlagMeaning
CLONE_FILESParent and child share open files.
CLONE_FSParent and child share filesystem information.
CLONE_IDLETASKSet PID to zero, used only by idle tasks.
CLONE_NEWNSCreate a new namespace for the child.
CLONE_PARENTChild has the same parent as its parent.
CLONE_PTRACEContinue tracing the child.
CLONE_SETTIDWrite the TID back to user space.
CLONE_SETTLSCreate a new TLS for the child.
CLONE_SIGHANDParent and child share signal handlers and blocked signals.
CLONE_SYSVSEMParent and child share System V SEM_UNDO semantics.
CLONE_THREADParent and child are in the same thread group.
CLONE_VFORKvfork() was used and the parent sleeps until the child wakes it.
CLONE_UNTRACEDDo not let the tracing process force CLONE_PTRACE on the child.
CLONE_STOPStart the process in the TASK_STOPPED state.
CLONE_CHILD_CLEARTIDClear the TID in the child.
CLONE_CHILD_SETTIDSet the TID in the child.
CLONE_PARENT_SETTIDSet the TID in the parent.
CLONE_VMParent and child share address space.

Kernel Threads

Kernel performs background work without associating that work with a user-space process via kernel threads which are standard tasks that exist only in kernel space.

  • possess no user address space, so their mm pointer is NULL
  • operate only in kernel space and never context switch into user space
  • remain schedulable and preemptable like normal tasks
  • handle background kernel work such as flush tasks and ksoftirqd

During boot, the kernel creates kthreadd, and new kernel threads are forked from this process.

The interface for creating a kernel thread is declared in <linux/kthread.h>:

struct task_struct *kthread_create(
	int (*threadfn)(void *data),
	void *data,
	const char namefmt[], ...
)

The new task is created via the clone() system call by the kthread kernel process in an unrunnable state, so the caller must explicitly wake it with wake_up_process().

The macro kthread_run() combines creation and wakng up:

  • creates the kernel thread with kthread_create()
  • wakes it with wake_up_process()

Once started, a kernel thread continues running until it exits itself or another part of the kernel stops it.

  • do_exit() terminates the current kernel thread.
  • kthread_stop() requests termination of a kernel thread using the task_struct returned by kthread_create().
int kthread_stop(struct task_struct *k)

Whether a user thread, kernel thread, or standard process, all tasks eventually reach the end of their lifecycle and must be dismantled.

Process Termination

When a process terminates, the kernel releases the resources owned by the process and notifies the parent of the process’s death.

Triggers: Generally a process destruction is self-induced while can also terminate involuntarily.

  • intentional termination occurs through the exit() system call
  • returning from main() implicitly calls exit() through compiler/runtime support
  • involuntary termination occurs when the process receives a signal or exception it cannot handle or ignore

Regardless of the cause, most process destruction is handled by do_exit() in kernel/exit.c.

  1. Sets the PF_EXITING flag in task_struct.
  2. Calls del_timer_sync() to remove kernel timers and ensure no timer handler is still running.
  3. Calls acct_update_integrals() if BSD process accounting is enabled to write out information.
  4. Calls exit_mm() to release the task’s mm_struct; if the address space is not shared, the kernel destroys it.
  5. Calls exit_sem() to dequeue the task from IPC semaphores if needed.
  6. Calls exit_files() and exit_fs() to decrement usage counts for file descriptor and filesystem objects.
  7. Stores the exit status in exit_code so the parent can retrieve it later.
  8. Calls exit_notify() to signal the parent, reparent children, and set the task’s exit state to EXIT_ZOMBIE.
  9. Calls schedule() to switch to another task.
  10. The do_exit() never returns After this point, the task is no longer runnable and no longer has an address space in which to execute.

The task still remains as a zombie so the parent can collect its termination information. At this stage, most owned resources are already gone, but the kernel still keeps enough task state to report the exit result.

  • the kernel stack remains
  • thread_info
  • task_struct

The descriptor removal is a separate step. The kernel cleans up most resources during do_exit(), but the process descriptor is not freed until the parent collects the child’s status or tells the kernel it does not care.

The wait() family of functions is implemented through wait4() whose usual behavior is to suspend the calling task until one of its children exits.

Removing the Process Descriptor:

When a child exits, wait4() returns the child’s PID and can also copy the child’s exit status back to the parent and to finally deallocate the process descriptor release_task() performs the final removal after the parent has consumed the child’s exit information:

  1. Calls __exit_signal(), which calls detach_pid() to remove the process from PID bookkeeping and the task list.
  2. The kernel finalizes statistics and releases remaining resources.
  3. If the task was the last member of a thread group, and the group leader is a zombie, release_task() notifies the zombie leader’s parent.
  4. put_task_struct() frees the kernel stack, thread_info, and the task_struct slab object.

The Dilemma of the Parentless Task:

If a parent exits before its children, those children need a new parent. Otherwise, parentless tasks could become permanent zombies and waste memory.

The solution is reparenting. During exit, do_exit() calls exit_notify(), which eventually calls forget_original_parent() and find_new_reaper().

  • the kernel first tries to reparent children to another task in the same thread group
  • if no suitable thread-group task exists, the children are reparented to the init process
  • init routinely calls wait() on its children, which cleans up zombies assigned to it

The kernel keeps ptraced children on a separate list because traced tasks are temporarily reparented to the debugger. This lets parent-exit cleanup handle them without scanning every process in the system.

Process Scheduling

The scheduler is the kernel subsystem that decides which runnable task executes next, keeping processors busy while making multiple processes appear to run at the same time.

The basic rule is simple: if at least one task is runnable, a processor should not sit idle. When there are more runnable tasks than available processors, some tasks must wait on a runqueue until the scheduler chooses them. The scheduler’s core responsibility is deciding which runnable task should run next.

Multitasking

A multitasking operating system interleaves the execution of multiple tasks so the system can make progress on more than one program over time.

  • On a single-processor machine, multitasking creates the illusion that multiple tasks are running concurrently.
  • On a multiprocessor machine, tasks can actually run in parallel on different processors.

Many tasks in memory are not runnable at a given moment because they are sleeping while waiting for work, I/O, a timer, a lock, or another event.

The scheduler only chooses from runnable tasks, so a system can have many tasks in memory but only a small number eligible to execute. Multitasking is commonly divided into two models:

  • Preemptive multitasking: The scheduler can involuntarily stop the currently running task and switch to another runnable task.
  • Cooperative multitasking: A task keeps running until it voluntarily yields or blocks, which makes system responsiveness depend on task behavior.

Preemptive multitasking requires scheduling policy because the kernel must balance competing goals: low latency for interactive tasks and high throughput for processor-bound tasks.

Policy

Scheduling policy determines how the scheduler balances responsiveness, fairness, and processor utilization.

Tasks usually fall into two broad behavior patterns:

  • I/O-bound tasks: Spend most of their time sleeping while waiting for I/O, user input, timers, locks, or other events. They run briefly after waking and need low latency.
  • Processor-bound tasks: Spend most of their time executing code and block infrequently. They benefit from longer uninterrupted execution because it improves throughput and cache locality.

These categories are not strict since a task can behave like an I/O-bound task at one moment and a processor-bound task at another.

Scheduling policy must balance two competing goals:

  • Low latency: Interactive and wakeup-heavy tasks should run quickly after becoming runnable.
  • High throughput: Processor-bound tasks should get enough continuous CPU time to do useful work efficiently.

Linux generally favors good interactive response while still preventing processor-bound tasks from being starved.

Process Priority: The priority expresses how much processor time or scheduling preference a task should receive.

Timeslices: A timeslice is the amount of processor time a task can run before the scheduler considers preempting it. Traditional schedulers often assigned fixed timeslices based on priority.

Fixed timeslices create a policy tradeoff:

  • A long timeslice improves throughput and cache locality, especially for processor-bound tasks.
  • A short timeslice improves responsiveness, but too much switching wastes CPU time on context-switch overhead.
  • I/O-bound tasks usually do not need long continuous execution, but they need to run quickly when they wake up.
  • Processor-bound tasks benefit from longer continuous execution because they can keep the processor and caches busy.

The O(1) Scheduler

The scheduler was called O(1) because selecting the next task took constant time, regardless of the number of runnable tasks in the system.

Key ideas:

  • The scheduler used priority-based runqueues.
  • Each CPU had its own runqueue, which improved scalability on multiprocessor systems.
  • Runnable tasks were placed into priority arrays.
  • The scheduler could find the highest-priority runnable task without scanning every task.
  • Tasks received timeslices, and those timeslices were tied to priority.
  • Higher-priority tasks generally ran sooner and could receive more favorable CPU treatment.

The O(1) scheduler scaled well for large server workloads because it avoided expensive global scans and kept scheduling decisions fast.

Its main weakness was interactive latency because its heuristics for detecting and rewarding wakeup-heavy interactive tasks could behave poorly in some workloads.

The Completely Fair Scheduler (CFS)

CFS is the scheduler class used for normal, non-real-time tasks which replaced fixed-timeslice scheduling with a fair-share model based on how much processor time each runnable task has already received.

Linux’s scheduler is modular through scheduler classes. Each class implements its own scheduling algorithm and has a priority relative to the other classes.

  • The generic scheduler code in kernel/sched.c checks scheduler classes in priority order.
  • The highest-priority scheduler class with a runnable task chooses the next task.
  • CFS handles the SCHED_NORMAL class for non-real-time tasks.
  • CFS is implemented in kernel/fair.c.

Linux uses two separate priority spaces:

  • Nice values: Normal tasks use nice values from to , with a default of . Lower nice values mean higher priority, while higher nice values mean lower priority.
  • Real-time priorities: Real-time tasks use priorities from to , where higher values mean higher priority. Real-time tasks have strict priority over normal tasks.

Timeslices: The core problem is that absolute timeslices produce a constant switching rate but variable fairness. CFS takes the opposite approach where it does not assign fixed timeslices to normal tasks in the traditional sense; instead, it assigns each runnable task a fair proportion of processor time. This gives CFS constant fairness with a variable switching rate.

Fair Scheduling Model

CFS is based on:

Perfect Multitasking: each process would receive of the processor’s time, whee is the number of runnable processes, and they are scheduled for infinitely small durations..

A real single processor cannot literally run multiple tasks at once as context switches consume time and disturb cache state, so CFS cannot run every task for infinitesimally small intervals.

CFS approximates the ideal model by running one task for a small amount of time, then choosing the runnable task that has received the least fair processor time.

The effective run duration for a task is based on its weight relative to the total weight of all runnable tasks:

  • A lower nice value gives the task a larger weight and a larger processor share.
  • A higher nice value gives the task a smaller weight and a smaller processor share. A task’s effective slice is proportional to task weight / total run weight.

Targeted Latency: CFS uses a scheduling window called targeted latency to approximate ‘infinitely small’ in ideal multitasking. Smaller targeted latency:

  • improves interactivity and fairness approximation.
  • at the expense of context-switch overhead and
  • can reduce throughput.

For example, with a targeted latency of 20ms:

  • Two equal-priority runnable tasks each run for about 10ms.
  • Twenty equal-priority runnable tasks would each approach 1ms.

Minimum Granularity: As the number of runnable tasks grows, each task’s calculated share approaches zero. CFS prevents excessive context switching by enforcing a minimum granularity, which acts as a floor on the calculated slice. The default is 1ms value.

Nice Value: CFS uses nice values as geometric weights, not additive millisecond values. The absolute nice value is less important than the relative difference between runnable tasks.

CFS is called fair because each task receives a fair proportion of processor time. It is not perfectly fair because it only approximates the ideal model, but it bounds unfairness by controlling scheduling latency among runnable tasks.

Wakeups:

CFS handles interactive tasks through fairness instead of special interactive heuristics. When a sleeping task wakes up, CFS compares how much processor time it has received relative to the currently running task and the rest of the runnable set. If the woken task has received less than its fair share, it can preempt the current task.

CFS Implementation

CFS must meticulously track process runtimes to guarantee each receives its fair proportion. Specially, four components of CFS:

  • Time Accounting
  • Process Selection
  • The Scheduler Entry Point
  • Sleeping and Waking Up

Time Accounting

All schedulers must account for how long a task runs: traditional Unix schedulers often decremented a task’s timeslice on every timer tick, while CFS tracks precise runtime and converts it into weighted fair time.

The accounting is stored in struct sched_entity, which is embedded in struct task_struct.

Field or conceptPurpose
loadStores the task’s scheduling weight derived from its nice value.
run_nodeLinks the task into the CFS red-black tree while it is runnable.
group_nodeLinks the entity into scheduling group structures.
on_rqRecords whether the entity is currently on a runqueue.
exec_startTracks when the entity most recently started executing.
sum_exec_runtimeAccumulates the task’s total actual runtime.
vruntimeStores the task’s virtual runtime, which is actual runtime normalized by weight.
prev_sum_exec_runtimeStores the previous total runtime value used for runtime deltas.
CONFIG_SCHEDSTATS fieldsOptional statistics such as wakeup timing, overlap, migrations, and start runtime.

Virtual Runtime (vruntime): CFS uses vruntime to compare how much fair processor time each runnable task has received.

  • It is measured in nanoseconds, so it is not tied to the system timer tick.
  • It is weighted by task priority, so higher-priority tasks accumulate vruntime more slowly than lower-priority tasks.
  • On an ideal perfectly fair processor, tasks at the same priority would have identical vruntime values.

Updating Accounting: update_curr() calculates delta_exec, which is the actual execution time since the last accounting update. It then passes that value to __update_curr(), which applies load weighting and adds the weighted result to vruntime.

The scheduler continuously compares vruntime values so it can choose the runnable task that has received the least fair processor time.

The Red-Black Tree

CFS stores runnable tasks in a red-black tree, called an rbtree in Linux, where nodes are keyed by vruntime so the scheduler can efficiently find the task with the smallest vruntime.

  • Task Selection: The __pick_next_entity() function executes process selection. It does not traverse the tree; instead, it immediately accesses the cached leftmost node (rb_leftmost), executing in time. If the tree is empty, CFS schedules the idle task.
  • Enqueuing: When tasks wake up or first created via fork(), enqueue_entity() inserts nodes from the rbtree and aggressively update the rb_leftmost cache variable to ensure structural integrity.
  • Dequeuing: When a process blocks or terminates, dequeue_entity() removes nodes from the rbtree and aggressively update the rb_leftmost cache variable to ensure structural integrity.

Scheduler Entry Point

The schedule() (kernel/sched.c) function serves as the generic main entry point into the process scheduler.

  • The schedule() calls pick_next_task(), which iterates through all scheduler classes in strict priority order. The highest-priority class with a runnable process invokes its specific selection algorithm (e.g., CFS’s __pick_next_entity()).
  • If all runnable tasks belong to CFS, the scheduler skips the full scheduler-class scan and quickly selects the next CFS task.

Sleep and Wake Up

A task sleeps when it is waiting for an event, such as a timer expiration, file I/O data, a lock, or a hardware event.

Sleeping: A task marks itself nonrunnable TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE, removes itself from the rbtree, and waits on a wait queue.

Wait queues: A wait queue (wait_queue_head_t) is a list of tasks waiting for a specific condition.

  • Wait queues are created statically with DECLARE_WAITQUEUE() or dynamically with init_waitqueue_head().
  • A task creates a wait queue entry with macro DEFINE_WAIT().
  • The task adds itself to the wait queue with add_wait_queue().
  • Code elsewhere must call wake_up() on the wait queue when the event occurs.

Typical sleep loop:

  1. The task calls prepare_to_wait() to set its state to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE and add itself to the wait queue if needed.
  2. The task checks the condition after joining the wait queue but before calling schedule(), which prevents it from missing an event that occurs just before it actually sleeps.
  3. If the condition is false, the task releases any locks that must not be held while sleeping, calls schedule() to yield the processor, and reacquires needed locks after wakeup.
  4. When the task wakes, it checks the condition again.
  5. If the condition is still false, the loop repeats.
  6. Once the condition is true, the task sets back to TASK_RUNNING and calls finish_wait() to remove itself from the wait queue.

The TASK_INTERRUPTIBLE sleep can end because of a signal rather than the target event. This is a spurious wakeup, so sleeping code must always recheck the condition and handle signals if needed.

Waking up: The code that causes the event usually calls wake_up() on the wait queue. For example, when disk data arrives, the VFS can call wake_up() on the queue containing tasks waiting for that data.

wake_up() wakes tasks on the wait queue by calling try_to_wake_up():

  • It sets the task state to TASK_RUNNING.
  • It calls enqueue_task() to add the task back to the rbtree.
  • It sets need_resched if the awakened task should preempt the currently running task.

The state transition flow can be summarized as a task leaving the rbtree to sleep, waiting for an event, and returning when woken:

Context Switching

The context switching is the low-level transition from one runnable task to another and is handled by context_switch() defined in kernel/sched.c after schedule() selects the next task.

  • Calls switch_mm():
    • Switches the virtual memory mapping from the previous task to the next task.
    • Defined through architecture MMU context code such as <asm/mmu_context.h>.
  • Calls switch_to():
    • Switches the processor state, saving and restoring stack state, processor registers, and architecture-specific per-task state.
    • Defined through architecture switch code such as <asm/system.h> in older kernels.

The need_resched Flag: is a per-task flag that tells the kernel a reschedule is needed. Set by scheduler_tick()

  • when a process should be preempted, and
  • by try_to_wake_up() when a process that has a higher priority than the currently running process is awakened.
  • upon returning to user-space or from an interrupt, the kernel checks this flag and invokes the scheduler if it is set. The flag is stored per task instead of globally because the current task’s process descriptor is fast to access and likely cache-hot.

User Preemption

User preemption occurs when the kernel is about to return to user-space, finds need_resched set, and safely calls the scheduler before resuming user execution.

User preemption can occur:

  • When returning to user-space after a system call.
  • When returning to user-space after an interrupt handler.

These return paths are architecture-dependent and are usually implemented in assembly entry/exit code such as entry.S.

Kernel Preemption

Linux is a fully preemptive kernel, so a task running in kernel-space can be preempted at a safe point, but not while holding a lock because locks mark nonpreemptible regions around shared-data updates.

Preemption safety is tracked with preempt_count in each task low-level state:

  • It starts at 0.
  • It increments when the task enters a nonpreemptible region, such as acquiring a lock.
  • It decrements when the task leaves that region, such as releasing a lock.
  • If preempt_count is 0 and need_resched is set, kernel preemption is allowed.

Kernel preemption can occur:

  • When returning from an interrupt handler to kernel-space.
  • When a lock is released and preempt_count drops to 0.
  • When kernel code explicitly blocks by calling schedule().

Real-Time Scheduling

Linux manages real-time tasks via a specialized real-time scheduler class, bypassing CFS entirely. Linux real-time scheduling provides soft real-time behavior, attempting to meet deadlines without guaranteeing them.

  • SCHED_FIFO: A first-in, first-out algorithm devoid of timeslices. A runnable SCHED_FIFO task preempts any SCHED_NORMAL task and runs continuously until it blocks, explicitly yields, or is preempted by a higher-priority real-time task. Same-priority SCHED_FIFO tasks execute round-robin.
  • SCHED_RR: A round-robin real-time algorithm identical to SCHED_FIFO except it enforces a predetermined timeslice for tasks at the identical priority level.

The kernel does not calculate dynamic priorities for real-time tasks; higher-priority real-time tasks categorically preempt lower-priority tasks, and user-space manages scheduler parameters through dedicated system calls.

Linux provides a family of C-library wrapped system calls to explicitly manage scheduler parameters.

System callCategoryPurpose
nice()Priority configurationAdjusts the static priority of normal tasks.
sched_setparam()Priority configurationSets rt_priority parameters for real-time tasks.
sched_getparam()Priority configurationGets rt_priority parameters for real-time tasks.
sched_setscheduler()Policy assignmentSets a task’s scheduling policy, such as SCHED_NORMAL, SCHED_FIFO, or SCHED_RR.
sched_getscheduler()Policy assignmentGets a task’s current scheduling policy.
sched_setaffinity()Processor affinitySets the cpus_allowed bitmask so the task runs only on designated processors.
sched_getaffinity()Processor affinityGets the task’s current cpus_allowed bitmask.
sched_yield()YieldingTemporarily yields processor control; CFS tasks are preempted, while real-time tasks move to the back of their priority list.

Conclusion

This diagram shows the top-level process management flow in the Linux kernel, from user-visible actions through task state, scheduling, sleep/wakeup, context switching, and exit.



flowchart TB

  subgraph S0["0. User-visible entry points"]
    direction LR
    USER(["user-visible actions<br/>fork()<br/>exec()<br/>block<br/>exit / wait<br/>sched_* / nice"])
  end

  subgraph S1["1. Task object model"]
    direction TB
    TASK[("struct task_struct<br/>pid, tgid, comm<br/>__state, exit_state<br/>policy, prio, sched_class<br/>on_rq, on_cpu<br/>mm, files, fs, signal, sighand<br/>parent, children, sibling")]
    TI[("thread_info<br/>low-level task flags<br/>TIF_NEED_RESCHED")]
    SE[("struct sched_entity<br/>load weight<br/>run_node<br/>vruntime<br/>deadline / slice<br/>sum_exec_runtime")]
    STATE(["task state model<br/>TASK_RUNNING<br/>TASK_INTERRUPTIBLE<br/>TASK_UNINTERRUPTIBLE<br/>TASK_STOPPED / TASK_TRACED<br/>EXIT_ZOMBIE<br/>TASK_DEAD"])
  end

  subgraph S2["2. Process creation and exec"]
    direction TB
    CREATE[["fork / clone path<br/>kernel_clone<br/>copy_process<br/>dup_task_struct<br/>copy/share mm files fs signals<br/>sched_fork<br/>wake_up_new_task"]]
    CHILD(["newborn child<br/>task_struct exists<br/>not yet running<br/>becomes runnable after wake_up_new_task"])
    EXEC[["execve path<br/>same task_struct<br/>replace program image<br/>replace or rebuild address space"]]
  end

  subgraph S3["3. Scheduler architecture"]
    direction TB
    RQ[("per-CPU struct rq<br/>curr: running task<br/>idle: idle task<br/>cfs: fair runqueue<br/>rt: RT runqueue<br/>dl: deadline runqueue")]
    CORE[["scheduler core<br/>schedule<br/>__schedule<br/>pick_next_task<br/>class order:<br/>deadline → RT → fair → idle"]]
    NEED[["reschedule request<br/>scheduler_tick or wakeup_preempt<br/>resched_curr<br/>set TIF_NEED_RESCHED<br/>schedule at safe point"]]
  end

  subgraph S4["4. CFS scheduler details"]
    direction TB
    CFSRQ[("struct cfs_rq<br/>fair runqueue state<br/>current sched_entity<br/>vruntime accounting<br/>runnable load")]
    TREE[("tasks_timeline rb-tree<br/>stores runnable sched_entity nodes<br/>uses run_node<br/>fair / EEVDF ordering data")]
    CFSOPS[["fair scheduler operations<br/>enqueue_task_fair / enqueue_entity<br/>dequeue_task_fair / dequeue_entity<br/>pick_next_task_fair / pick_eevdf<br/>update_curr"]]
  end

  subgraph S5["5. Sleep and wakeup"]
    direction TB
    SLEEP(["sleeping<br/>task is not runnable<br/>waiting for condition/event"])
    WAITQ[("wait queue<br/>wait_queue_head: lock + list<br/>wait_queue_entry: task/private + wake function")]
    SLEEPPATH[["sleep path<br/>set_current_state<br/>prepare_to_wait<br/>condition check<br/>schedule"]]
    WAKEPATH[["wake path<br/>wake_up / __wake_up_common<br/>try_to_wake_up<br/>activate_task<br/>enqueue_task"]]
  end

  subgraph S6["6. Context switch"]
    direction TB
    SWITCH[["context_switch(prev, next)<br/>switch_mm: address space<br/>switch_to: CPU registers / stack<br/>finish_task_switch<br/>rq-&gt;curr = next"]]
    RUNNING(["actually running<br/>selected task owns CPU<br/>current points to it"])
  end

  subgraph S7["7. Exit and reaping"]
    direction TB
    EXITING(["exiting / zombie<br/>task stopped executing<br/>parent may still need status"])
    EXITPATH[["exit path<br/>do_exit<br/>exit_mm / exit_files / exit_fs<br/>exit_notify<br/>set EXIT_ZOMBIE"]]
    REAPPATH[["reap path<br/>wait4 / waitpid<br/>release_task<br/>put_task_struct<br/>final release"]]
  end

  subgraph LIFE["8. Lifecycle strip"]
    direction LR
    L1(["created"])
    L2(["runnable"])
    L3(["running"])
    L4(["sleeping"])
    L5(["woken"])
    L6(["zombie"])
    L7(["dead"])
  end

  TASK -- contains or reaches --> TI
  TASK -- embeds fair scheduling state --> SE
  TASK -- stores state in __state / exit_state --> STATE

  USER -- fork / clone request --> CREATE
  CREATE -- allocates and initializes --> TASK
  CREATE -- produces --> CHILD
  CHILD -- first runnable transition updates --> STATE

  USER -- execve request --> EXEC
  EXEC -- mutates existing task --> TASK

  STATE -- TASK_RUNNING tasks may be queued on --> RQ
  RQ -- is searched by --> CORE
  CORE -- may be entered because of --> NEED
  NEED -- sets flag inside --> TI

  RQ -- contains fair runqueue --> CFSRQ
  SE -- run_node is stored in --> TREE
  CFSRQ -- owns --> TREE
  CORE -- delegates normal tasks to --> CFSOPS
  CFSOPS -- updates / queries --> CFSRQ
  CFSOPS -- returns selected normal task to --> CORE

  USER -- blocking syscall begins --> SLEEPPATH
  SLEEPPATH -- sets task state to --> SLEEP
  SLEEPPATH -- links waiter into --> WAITQ
  SLEEPPATH -- enters scheduler through --> CORE

  WAITQ -- event notification walks --> WAKEPATH
  WAKEPATH -- sets TASK_RUNNING and queues task on --> RQ
  WAKEPATH -- updates task state to --> STATE

  CORE -- selects next task and calls --> SWITCH
  SWITCH -- updates curr field in --> RQ
  SWITCH -- makes next task --> RUNNING
  RUNNING -- is represented by --> TASK

  USER -- exit or fatal signal begins --> EXITPATH
  EXITPATH -- releases resources from --> TASK
  EXITPATH -- produces --> EXITING
  EXITING -- parent wait consumes status through --> REAPPATH
  REAPPATH -- final reference releases --> TASK
  REAPPATH -- updates state to TASK_DEAD in --> STATE

  L1 -- wake_up_new_task --> L2
  L2 -- pick_next_task --> L3
  L3 -- set_current_state + schedule --> L4
  L4 -- try_to_wake_up --> L5
  L5 -- enqueue_task --> L2
  L3 -- do_exit --> L6
  L6 -- wait + release_task --> L7

  CHILD -- corresponds to --> L1
  RQ -- contains runnable tasks for --> L2
  RUNNING -- corresponds to --> L3
  SLEEP -- corresponds to --> L4
  WAKEPATH -- corresponds to --> L5
  EXITING -- corresponds to --> L6
  REAPPATH -- corresponds to --> L7

  classDef concept fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#0f172a
  classDef data fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#0f172a
  classDef proc fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#0f172a

  class USER,STATE,CHILD,SLEEP,RUNNING,EXITING,L1,L2,L3,L4,L5,L6,L7 concept
  class TASK,TI,SE,RQ,CFSRQ,TREE,WAITQ data
  class CREATE,EXEC,CORE,NEED,CFSOPS,SLEEPPATH,WAKEPATH,SWITCH,EXITPATH,REAPPATH proc

  style S0 fill:#eff6ff,stroke:#93c5fd,stroke-width:2px
  style S1 fill:#fffbeb,stroke:#f59e0b,stroke-width:2px
  style S2 fill:#f0fdf4,stroke:#86efac,stroke-width:2px
  style S3 fill:#f5f3ff,stroke:#c4b5fd,stroke-width:2px
  style S4 fill:#faf5ff,stroke:#c084fc,stroke-width:2px
  style S5 fill:#f0f9ff,stroke:#7dd3fc,stroke-width:2px
  style S6 fill:#fefce8,stroke:#fde047,stroke-width:2px
  style S7 fill:#fff1f2,stroke:#fb7185,stroke-width:2px
  style LIFE fill:#f8fafc,stroke:#94a3b8,stroke-width:2px