Process Scheduling
Multitasking Fundamentals
A multitasking operating system simultaneously interleaves the execution of multiple processes.
- Preemptive Multitasking: The scheduler dictates when a process ceases running and another begins, involuntarily suspending the active process (preemption).
- Timeslice: The predetermined duration a process runs before it is preempted.
- Cooperative Multitasking: A process continues executing until it voluntarily suspends itself, an act known as yielding. To implement preemptive multitasking, a system requires defined policies to balance varying and often conflicting workload demands.
Scheduling Policy and Workload Classifications
Scheduling policy determines system behavior, overall feel, and optimal processor utilization. Processes classify into two primary behaviors:
- I/O-Bound Processes: Spend the majority of time waiting on I/O requests, running for short durations before blocking. They require fast response times (low latency).
- Processor-Bound Processes: Spend the majority of time executing code with infrequent blocking. They require longer timeslices to maximize system utilization (high throughput).
Linux optimizes for process response, implicitly favoring I/O-bound processes while employing creative mechanics to avoid neglecting processor-bound tasks.
Process Priority Priority-based scheduling algorithms rank processes by need, running higher-priority processes before lower-priority ones. Linux employs two disjoint priority spaces:
- Nice Values: Range from to , with a default of . Larger values indicate lower priority. In Linux, nice values control the proportion of timeslice allocated to a process, rather than an absolute millisecond value.
- Real-Time Priority: Range from to . Higher values indicate higher priority. All real-time processes possess strict priority over normal processes operating in the nice value space.
Traditional absolute timeslice algorithms map nice values directly to milliseconds, which yields constant switching rates but variable fairness. This limitation prompts a paradigm shift to fairer, proportion-based distribution models.
The Completely Fair Scheduler (CFS)
The Completely Fair Scheduler (CFS) is the algorithm assigned to the SCHED_NORMAL scheduler class, operating on all non-real-time processes in Linux.
- Perfect Multitasking Model: CFS models process scheduling against an ideal, perfectly multitasking processor. In this model, runnable processes run simultaneously, each receiving exactly of the processor’s power.
- Targeted Latency: CFS approximates perfect multitasking by assigning each process a dynamic timeslice proportional to its weight, dividing a global approximation window called the targeted latency.
- Minimum Granularity: As , the allocated proportion approaches zero, which would cause unacceptable context switching overhead. CFS enforces a minimum granularity (default 1 millisecond) as an absolute floor on any calculated timeslice.
- Nice Value Weighting: Absolute nice values yield geometric differences in timeslices. A process’s processor proportion is derived purely from the relative difference in niceness between it and other currently runnable processes. To enforce this theoretical fair distribution, CFS relies on precise, tick-decoupled time accounting data structures.
CFS Implementation and Time Accounting
CFS must meticulously track process runtimes to guarantee each receives its fair proportion.
- Scheduler Entity Structure: Process accounting is maintained in
struct sched_entity, embedded within eachstruct task_struct. - Virtual Runtime (
vruntime): Represents the actual execution time normalized (weighted) by the total number of runnable processes.- It is measured in nanoseconds, decoupling it completely from the system timer tick.
- On an ideal perfect processor, the
vruntimeof all processes at the same priority would be identical.
- Updating Accounting: The
update_curr()function calculatesdelta_exec(execution time since the last update), passes it to__update_curr()to apply load weighting, and adds the result tovruntime. The scheduler must continuously evaluatevruntimeacross all tasks to dynamically select the most appropriate next process.
Process Selection and the Red-Black Tree
The core CFS algorithm selects the process with the smallest vruntime.
- The Red-Black Tree (rbtree): CFS manages the list of all runnable processes using a self-balancing binary search tree. The tree nodes are keyed by the process’s
vruntime. - Task Selection: The
__pick_next_entity()function executes process selection. It does not traverse the tree; instead, it immediately accesses the cached leftmost node (rb_leftmost), executing in time. If the tree is empty, CFS schedules the idle task. - Enqueuing/Dequeuing: When tasks wake up or block,
enqueue_entity()anddequeue_entity()insert or remove nodes from the rbtree and aggressively update therb_leftmostcache variable to ensure structural integrity. Process selection algorithms execute at a central entry point responsible for triggering context switches or managing sleep states.
Scheduler Entry Point and Sleep/Wake States
The schedule() function serves as the generic main entry point into the process scheduler.
- Selecting the Next Task:
schedule()callspick_next_task(), which iterates through all scheduler classes in strict priority order. The highest-priority class with a runnable process invokes its specific selection algorithm (e.g., CFS’s__pick_next_entity()). - Sleeping (Blocking): Tasks waiting for specific events mark themselves as nonrunnable (
TASK_INTERRUPTIBLEorTASK_UNINTERRUPTIBLE), remove themselves from the rbtree, and join a wait queue. - Wait Queue Mechanics: Wait queues (
wait_queue_head_t) are lists of waiting processes.- The task calls
prepare_to_wait()to update its state and join the queue. - It checks the target condition to prevent race conditions.
- It calls
schedule()to hand over the processor.
- The task calls
- Waking Up: The
wake_up()function awakens tasks on a wait queue by invokingtry_to_wake_up(). This marks the taskTASK_RUNNING, inserts it back into the rbtree, and setsneed_reschedif the newly awakened task has a higher priority than the currently running task. Whenschedule()successfully identifies a new task, the system must seamlessly swap execution contexts.
Preemption and Context Switching
Context switching—the transition from one runnable task to another—is executed by context_switch(). It performs two primary routines:
switch_mm(): Switches the virtual memory mapping from the previous process to the new process.switch_to(): Switches the processor state, saving and restoring stack information, processor registers, and architecture-specific states.
The need_resched Flag: A per-process flag indicating a reschedule is required. It is set by scheduler_tick() when a process exhausts its processor proportion, or by try_to_wake_up() when a higher-priority process awakens.
- User Preemption: Occurs when the kernel prepares to return to user-space (from a system call or an interrupt) and detects that
need_reschedis set. - Kernel Preemption: Linux is a fully preemptive kernel, allowing task preemption at any safe point in kernel-space.
- Preemption safety is tracked via the
preempt_countvariable; if , no locks are held, and preemption is safe. - Kernel preemption fires when exiting an interrupt handler to kernel-space, when a lock is released dropping
preempt_countto zero, or upon an explicit blocking call toschedule(). Standard context switching relies on proportional fairness, but real-time systems demand absolute deterministic scheduling policies.
- Preemption safety is tracked via the
Real-Time Scheduling Policies
Linux manages real-time tasks via a specialized real-time scheduler class, bypassing CFS entirely. Linux real-time scheduling provides soft real-time behavior, attempting to meet deadlines without guaranteeing them.
SCHED_FIFO: A first-in, first-out algorithm devoid of timeslices. A runnableSCHED_FIFOtask preempts anySCHED_NORMALtask and runs continuously until it blocks, explicitly yields, or is preempted by a higher-priority real-time task. Same-prioritySCHED_FIFOtasks execute round-robin.SCHED_RR: A round-robin real-time algorithm identical toSCHED_FIFOexcept it enforces a predetermined timeslice for tasks at the identical priority level.- Static Priorities: The kernel does not calculate dynamic priority values for real-time tasks; a higher-priority real-time task categorically preempts any lower-priority task. To interface with both standard and real-time scheduling mechanisms, user-space applications utilize dedicated system calls.
Scheduler-Related System Calls
Linux provides a family of C-library wrapped system calls to explicitly manage scheduler parameters.
- Priority Configuration:
nice()increments the static priority of normal tasks.sched_setparam()andsched_getparam()manipulate thert_priorityparameters for real-time tasks. - Policy Assignment:
sched_setscheduler()andsched_getscheduler()manipulate the scheduling policy (e.g., switching betweenSCHED_NORMAL,SCHED_FIFO, orSCHED_RR). - Processor Affinity:
sched_setaffinity()andsched_getaffinity()manipulate thecpus_allowedbitmask. The kernel enforces this hard affinity, ensuring the task only executes on the designated processors via migration threads and load balancing. - Yielding:
sched_yield()temporarily yields processor control. For CFS tasks, it preempts the process and inserts it into the expired array; for real-time tasks, it moves the task to the back of its specific priority list.