02 CPU Virtualization

The Process Abstraction

A process is the fundamental operating system abstraction of a running program. It consists of the machine state required for execution, including memory (the address space where code, static data, heap, and stack reside), hardware registers (program counter, stack pointer, frame pointer), and persistent storage structures like open file descriptors.

Process Creation: The operating system initializes a process by eagerly or lazily loading program code and static data from disk into memory. It allocates memory for the run-time stack (initializing it with parameters like argc and argv) and sets aside space for the heap to manage dynamically allocated data. Finally, it initializes I/O file descriptors and jumps to the main() entry point.
Process States: A process transitions between three primary states:
- Running: Actively executing instructions on the processor.
- Ready: Ready to execute, but the OS scheduler has currently assigned the CPU to another process.
- Blocked: Paused and waiting for an external event (e.g., an I/O request) to complete before it can resume execution.
Process Control Block (PCB): The OS tracks process states and associated metadata using a process list or PCB, which stores the register context needed to stop and restart a process. Processes may also enter a zombie state upon exiting, allowing the parent process to read the return code before the OS cleans up the data structures.

To expose these process lifecycle states and structures to developers, operating systems provide specialized system interfaces.

Process Control API

UNIX systems manage process creation and control through a uniquely separated set of system calls, providing extensive flexibility.

fork(): Creates a new child process that is a nearly exact copy of the calling parent. The child possesses its own private memory and registers, but the function returns 0 to the child and the new Process ID (PID) to the parent. The order of execution between parent and child is non-deterministic and depends on the CPU scheduler.
wait(): Pauses the execution of the parent process until the child process finishes executing, enabling deterministic sequencing.
exec(): Loads a new executable from disk, completely overwriting the current process’s code, data, heap, and stack segments. A successful exec() never returns to the original calling program.
API Design Rationale: Separating process creation (fork()) from program execution (exec()) enables the shell to seamlessly modify the execution environment. Intervening code can close standard output and open files or pipes to implement I/O redirection before the new program executes.

In order to execute these newly spawned processes safely and efficiently, the operating system relies on hardware-assisted execution mechanisms.

Mechanism: Limited Direct Execution (LDE)

To virtualize the CPU efficiently, the OS runs programs directly on the hardware processor, but establishes limits to maintain control over the system.

Restricted Operations: Hardware supports execution in two distinct privilege levels.
- User Mode: Applications run with restricted access to hardware resources. Attempting privileged operations (e.g., I/O requests) raises a hardware exception.
- Kernel Mode: The OS runs with full access to all machine resources.
System Calls: Applications request privileged operations via a trap instruction, which simultaneously raises the privilege level to kernel mode and jumps to a pre-specified handler address. Upon completion, the OS executes a return-from-trap instruction to revert to user mode and resume the application.
Trap Tables: At boot time, the OS operates in kernel mode and configures the hardware trap table, defining the exact memory locations of handlers for system calls, interrupts, and illegal operations.
Regaining Control: The OS must be able to switch between running processes to implement time-sharing.
- Cooperative Approach: The OS waits for applications to yield control voluntarily via system calls or illegal operation traps.
- Non-Cooperative Approach: The OS initializes a hardware timer at boot. A timer interrupt fires periodically, halting the current process and forcing a trap into the OS, guaranteeing the OS regains control.
Context Switching: Once control is regained, the scheduler decides whether to switch processes. The hardware implicitly saves the running process’s registers onto its kernel stack. The OS software then executes a context switch routine to explicitly save the kernel registers to the current process’s PCB, restore the kernel registers from the target process’s PCB, and switch the active stack pointer.

With the mechanisms to safely pause and resume execution established, the operating system must employ specific policies to determine which process runs next.

Basic Scheduling Policies

Schedulers resolve resource contention by applying algorithms evaluated against specific performance metrics.

Scheduling Metrics:
- Turnaround Time: A performance metric measuring the time from arrival to completion. $T_{t u r na ro u n d} = T_{co m pl e t i o n} - T_{a rr i v a l}$
- Response Time: An interactive metric measuring the time from arrival to the first time a job is scheduled. $T_{res p o n se} = T_{f i rs t r u n} - T_{a rr i v a l}$
First In, First Out (FIFO): Processes run to completion in arrival order. If a long-running job arrives before short jobs, FIFO suffers from the convoy effect, drastically inflating average turnaround time.
Shortest Job First (SJF): Runs the shortest job first. While optimal for turnaround time if all jobs arrive simultaneously, it still suffers from convoy effects if short jobs arrive after a long job has already begun execution.
Shortest Time-to-Completion First (STCF): Adds preemption to SJF. If a new job arrives with a shorter remaining time than the currently running job, the OS context-switches to the new job, optimizing turnaround time for varied arrivals.
Round Robin (RR): Optimizes for response time by running a job for a fixed time slice (scheduling quantum) before preempting it to run the next job in the queue. RR effectively minimizes response time but performs pessimally for turnaround time. The time slice length presents a trade-off: short slices improve responsiveness but increase context-switching overhead, which must be amortized by making the slice long enough.
I/O Overlap: To maximize CPU utilization, schedulers treat continuous CPU bursts as independent jobs. When a process blocks for an I/O request, the OS schedules another ready process to overlap CPU execution with the slow I/O operation.

Because basic scheduling policies rely on unrealistic assumptions like a priori knowledge of job lengths, modern operating systems employ history-based dynamic policies.

The Multi-Level Feedback Queue (MLFQ)

The MLFQ scheduler uses historical execution data to predict future behavior, simultaneously optimizing turnaround time for short jobs and response time for interactive jobs without knowing job lengths in advance.

Structure: MLFQ consists of multiple distinct queues, each assigned a different priority level. Processes at higher priority levels run before those at lower levels.
Rules of MLFQ:
1. If Priority(A) > Priority(B), A runs (B does not).
2. If Priority(A) == Priority(B), A and B run in Round Robin.
3. When a job enters the system, it is placed at the highest priority (the topmost queue).
4. Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (moves down one queue).
5. After some time period $S$ , move all jobs in the system to the topmost queue.
Vulnerabilities and Mitigations:
- Gaming the Scheduler: Processes could formerly exploit MLFQ by issuing I/O requests just before their time slice expired to retain high priority. Rule 4 introduces precise accounting of total CPU time used per priority level to defeat this.
- Starvation and Phase Changes: A continuous stream of interactive jobs can starve long-running CPU-bound jobs. Rule 5 (Priority Boost) ensures all jobs eventually receive CPU time and allows the scheduler to adapt if a CPU-bound job transitions to an interactive phase.
Parameter Tuning: Higher priority queues typically utilize very short time slices for rapid interactive responses, while lower priority queues utilize longer time slices suited for batch processing.

While multi-level queues optimize for turnaround and response time through heuristic observation, an alternative scheduling philosophy focuses strictly on guaranteeing exact proportional CPU allocation.

Proportional-share (or fair-share) schedulers guarantee that each process receives a specific percentage of CPU time rather than optimizing for turnaround or response time.

Lottery Scheduling: Utilizes tickets to represent a process’s share of the CPU. The scheduler generates a random number, and the process holding the winning ticket executes.
- Advantages of Randomness: Random algorithms avoid cyclic edge cases, require minimal per-process accounting state, and are exceptionally fast.
- Ticket Currency: Users allocate tickets in a custom local currency, which the OS automatically converts into a global ticket currency.
- Ticket Transfer: A client process can temporarily transfer its tickets to a server process, accelerating the server’s execution while handling the client’s request.
- Ticket Inflation: In a trusted environment, a process can dynamically alter the amount of tickets it owns to reflect changing resource demands.
Stride Scheduling: A deterministic alternative to lottery scheduling. Each job computes a stride by dividing a large constant by its ticket count. The scheduler tracks global progress using a pass counter, selecting the process with the lowest pass value and incrementing it by the process’s stride. While deterministically precise over short periods, Stride Scheduling requires complex global state management when new jobs enter the system, an issue Lottery Scheduling inherently avoids.

As single-CPU scheduling principles matured, the proliferation of multicore hardware demanded new architectures to distribute these workloads concurrently.

Multiprocessor Scheduling (Advanced)

Adapting single-CPU schedulers to multicore environments requires mitigating hardware-level caching complexities and managing physical execution locations.

Hardware Architecture Constraints:
- Cache Coherence: If a process modifies a value in CPU 1’s hardware cache, and the OS migrates the process to CPU 2, CPU 2 might read stale data from main memory. Hardware resolves this via bus snooping, where caches monitor memory buses to invalidate or update shared data.
- Synchronization: OS scheduling structures accessed across multiple CPUs must be protected by mutual exclusion primitives (locks) to ensure atomic updates and prevent data corruption.
- Cache Affinity: Processes build up extensive state in a CPU’s hardware cache and Translation Lookaside Buffer (TLB). A scheduler should prefer scheduling a process on the same CPU it previously executed on to avoid the performance penalty of reloading cache state.
Single-Queue Multiprocessor Scheduling (SQMS): Reuses a single global queue for all CPUs. It inherently struggles with scalability due to lock contention on the global queue, and naturally violates cache affinity as processes randomly bounce between available CPUs.
Multi-Queue Multiprocessor Scheduling (MQMS): Maintains a separate scheduling queue per CPU. MQMS inherently scales well and preserves cache affinity since jobs are restricted to specific queues. However, MQMS suffers from load imbalance if jobs exit unevenly or have varying durations.
Work Stealing: MQMS resolves load imbalances via migration. A CPU with an underutilized queue occasionally peeks at a target queue; if the target queue is significantly fuller, the underutilized CPU “steals” jobs to balance the global load.

My Knowledge Base

Explorer

02 CPU Virtualization

The Process Abstraction

Process Control API

Mechanism: Limited Direct Execution (LDE)

Basic Scheduling Policies

The Multi-Level Feedback Queue (MLFQ)

Multiprocessor Scheduling (Advanced)

My Knowledge Base

Explorer

02 CPU Virtualization

The Process Abstraction

Process Control API

Mechanism: Limited Direct Execution (LDE)

Basic Scheduling Policies

The Multi-Level Feedback Queue (MLFQ)

Proportional Share Scheduling

Multiprocessor Scheduling (Advanced)