Process Scheduling

Multitasking Fundamentals

A multitasking operating system simultaneously interleaves the execution of multiple processes.

Preemptive Multitasking: The scheduler dictates when a process ceases running and another begins, involuntarily suspending the active process (preemption).
Timeslice: The predetermined duration a process runs before it is preempted.
Cooperative Multitasking: A process continues executing until it voluntarily suspends itself, an act known as yielding. To implement preemptive multitasking, a system requires defined policies to balance varying and often conflicting workload demands.

Scheduling Policy and Workload Classifications

Scheduling policy determines system behavior, overall feel, and optimal processor utilization. Processes classify into two primary behaviors:

I/O-Bound Processes: Spend the majority of time waiting on I/O requests, running for short durations before blocking. They require fast response times (low latency).
Processor-Bound Processes: Spend the majority of time executing code with infrequent blocking. They require longer timeslices to maximize system utilization (high throughput).

Linux optimizes for process response, implicitly favoring I/O-bound processes while employing creative mechanics to avoid neglecting processor-bound tasks.

Process Priority Priority-based scheduling algorithms rank processes by need, running higher-priority processes before lower-priority ones. Linux employs two disjoint priority spaces:

Nice Values: Range from $- 20$ to $+ 19$ , with a default of $0$ . Larger values indicate lower priority. In Linux, nice values control the proportion of timeslice allocated to a process, rather than an absolute millisecond value.
Real-Time Priority: Range from $0$ to $99$ . Higher values indicate higher priority. All real-time processes possess strict priority over normal processes operating in the nice value space.

Traditional absolute timeslice algorithms map nice values directly to milliseconds, which yields constant switching rates but variable fairness. This limitation prompts a paradigm shift to fairer, proportion-based distribution models.

The Completely Fair Scheduler (CFS)

The Completely Fair Scheduler (CFS) is the algorithm assigned to the SCHED_NORMAL scheduler class, operating on all non-real-time processes in Linux.

Perfect Multitasking Model: CFS models process scheduling against an ideal, perfectly multitasking processor. In this model, $n$ runnable processes run simultaneously, each receiving exactly $1/ n$ of the processor’s power.
Targeted Latency: CFS approximates perfect multitasking by assigning each process a dynamic timeslice proportional to its weight, dividing a global approximation window called the targeted latency.
Minimum Granularity: As $n \to \infty$ , the allocated proportion approaches zero, which would cause unacceptable context switching overhead. CFS enforces a minimum granularity (default 1 millisecond) as an absolute floor on any calculated timeslice.
Nice Value Weighting: Absolute nice values yield geometric differences in timeslices. A process’s processor proportion is derived purely from the relative difference in niceness between it and other currently runnable processes. To enforce this theoretical fair distribution, CFS relies on precise, tick-decoupled time accounting data structures.

CFS Implementation and Time Accounting

CFS must meticulously track process runtimes to guarantee each receives its fair proportion.

Scheduler Entity Structure: Process accounting is maintained in struct sched_entity, embedded within each struct task_struct.
Virtual Runtime (vruntime): Represents the actual execution time normalized (weighted) by the total number of runnable processes.
- It is measured in nanoseconds, decoupling it completely from the system timer tick.
- On an ideal perfect processor, the vruntime of all processes at the same priority would be identical.
Updating Accounting: The update_curr() function calculates delta_exec (execution time since the last update), passes it to __update_curr() to apply load weighting, and adds the result to vruntime. The scheduler must continuously evaluate vruntime across all tasks to dynamically select the most appropriate next process.

Process Selection and the Red-Black Tree

The core CFS algorithm selects the process with the smallest vruntime.

The Red-Black Tree (rbtree): CFS manages the list of all runnable processes using a self-balancing binary search tree. The tree nodes are keyed by the process’s vruntime.
Task Selection: The __pick_next_entity() function executes process selection. It does not traverse the tree; instead, it immediately accesses the cached leftmost node (rb_leftmost), executing in $O (1)$ time. If the tree is empty, CFS schedules the idle task.
Enqueuing/Dequeuing: When tasks wake up or block, enqueue_entity() and dequeue_entity() insert or remove nodes from the rbtree and aggressively update the rb_leftmost cache variable to ensure structural integrity. Process selection algorithms execute at a central entry point responsible for triggering context switches or managing sleep states.

Scheduler Entry Point and Sleep/Wake States

The schedule() function serves as the generic main entry point into the process scheduler.

Selecting the Next Task: schedule() calls pick_next_task(), which iterates through all scheduler classes in strict priority order. The highest-priority class with a runnable process invokes its specific selection algorithm (e.g., CFS’s __pick_next_entity()).
Sleeping (Blocking): Tasks waiting for specific events mark themselves as nonrunnable (TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE), remove themselves from the rbtree, and join a wait queue.
Wait Queue Mechanics: Wait queues (wait_queue_head_t) are lists of waiting processes.
- The task calls prepare_to_wait() to update its state and join the queue.
- It checks the target condition to prevent race conditions.
- It calls schedule() to hand over the processor.
Waking Up: The wake_up() function awakens tasks on a wait queue by invoking try_to_wake_up(). This marks the task TASK_RUNNING, inserts it back into the rbtree, and sets need_resched if the newly awakened task has a higher priority than the currently running task. When schedule() successfully identifies a new task, the system must seamlessly swap execution contexts.

Preemption and Context Switching

Context switching—the transition from one runnable task to another—is executed by context_switch(). It performs two primary routines:

switch_mm(): Switches the virtual memory mapping from the previous process to the new process.
switch_to(): Switches the processor state, saving and restoring stack information, processor registers, and architecture-specific states.

The need_resched Flag: A per-process flag indicating a reschedule is required. It is set by scheduler_tick() when a process exhausts its processor proportion, or by try_to_wake_up() when a higher-priority process awakens.

User Preemption: Occurs when the kernel prepares to return to user-space (from a system call or an interrupt) and detects that need_resched is set.
Kernel Preemption: Linux is a fully preemptive kernel, allowing task preemption at any safe point in kernel-space.
- Preemption safety is tracked via the preempt_count variable; if $0$ , no locks are held, and preemption is safe.
- Kernel preemption fires when exiting an interrupt handler to kernel-space, when a lock is released dropping preempt_count to zero, or upon an explicit blocking call to schedule(). Standard context switching relies on proportional fairness, but real-time systems demand absolute deterministic scheduling policies.

Real-Time Scheduling Policies

Linux manages real-time tasks via a specialized real-time scheduler class, bypassing CFS entirely. Linux real-time scheduling provides soft real-time behavior, attempting to meet deadlines without guaranteeing them.

SCHED_FIFO: A first-in, first-out algorithm devoid of timeslices. A runnable SCHED_FIFO task preempts any SCHED_NORMAL task and runs continuously until it blocks, explicitly yields, or is preempted by a higher-priority real-time task. Same-priority SCHED_FIFO tasks execute round-robin.
SCHED_RR: A round-robin real-time algorithm identical to SCHED_FIFO except it enforces a predetermined timeslice for tasks at the identical priority level.
Static Priorities: The kernel does not calculate dynamic priority values for real-time tasks; a higher-priority real-time task categorically preempts any lower-priority task. To interface with both standard and real-time scheduling mechanisms, user-space applications utilize dedicated system calls.

Linux provides a family of C-library wrapped system calls to explicitly manage scheduler parameters.

Priority Configuration: nice() increments the static priority of normal tasks. sched_setparam() and sched_getparam() manipulate the rt_priority parameters for real-time tasks.
Policy Assignment: sched_setscheduler() and sched_getscheduler() manipulate the scheduling policy (e.g., switching between SCHED_NORMAL, SCHED_FIFO, or SCHED_RR).
Processor Affinity: sched_setaffinity() and sched_getaffinity() manipulate the cpus_allowed bitmask. The kernel enforces this hard affinity, ensuring the task only executes on the designated processors via migration threads and load balancing.
Yielding: sched_yield() temporarily yields processor control. For CFS tasks, it preempts the process and inserts it into the expired array; for real-time tasks, it moves the task to the back of its specific priority list.

My Knowledge Base

Explorer

04 Process Scheduling

Process Scheduling

Multitasking Fundamentals

Scheduling Policy and Workload Classifications

The Completely Fair Scheduler (CFS)

CFS Implementation and Time Accounting

Process Selection and the Red-Black Tree

Scheduler Entry Point and Sleep/Wake States

Preemption and Context Switching

Real-Time Scheduling Policies

My Knowledge Base

Explorer

04 Process Scheduling

Process Scheduling

Multitasking Fundamentals

Scheduling Policy and Workload Classifications

The Completely Fair Scheduler (CFS)

CFS Implementation and Time Accounting

Process Selection and the Red-Black Tree

Scheduler Entry Point and Sleep/Wake States

Preemption and Context Switching

Real-Time Scheduling Policies

Scheduler-Related System Calls