Threading

Binaries, Processes, and Threads

Binaries: Dormant compiled programs residing on storage, ready for execution.
Processes: The operating system abstraction of a running binary. Contains virtualized memory, kernel resources, an associated user, and one or more threads.
Threads: The smallest schedulable unit of execution. Consists of a virtualized processor, a stack, and program state.
Resource Virtualization Split:
- Memory: Virtualized per-process. All threads within a single process share the exact same memory address space.
- Processor: Virtualized per-thread. Each thread acts as an independent execution entity.
Threaded Architectures: A single-threaded process contains exactly one unit of execution. A multithreaded process contains multiple units of execution operating concurrently within the same memory space.

The separation of processor virtualization from memory virtualization establishes the fundamental architecture of threads, driving the distinct advantages and costs of adopting a multithreaded design.

Motivations and Costs of Multithreading

Parallelism: Distributes units of execution across multiple processors simultaneously to increase system throughput.
Responsiveness: Delegates long-running or blocking operations to background worker threads, allowing the primary thread to respond to user input.
Blocking I/O Isolation: Prevents a blocking I/O operation in one thread from halting the entire process.
Context Switching Efficiency: Intraprocess thread switching is significantly faster than interprocess switching. Thread switching preserves the virtual address space, avoiding costly Translation Lookaside Buffer (TLB) and processor cache flushes.
Memory Savings: Shares data naturally without the overhead of interprocess communication (IPC) or duplicate memory allocations.
Costs: Shared memory introduces severe concurrency bugs. Designing and debugging synchronized access to shared data is highly complex.
Alternatives: Event-driven architectures using multiplexed I/O, asynchronous I/O, or state machines can achieve responsiveness and I/O isolation without thread synchronization overhead.

The method by which an operating system implements threads determines how effectively these performance benefits are realized at the hardware level.

Threading Models

1:1 Threading (Kernel-level): Provides a one-to-one mapping between a user-space thread and a native kernel thread. The kernel natively schedules each thread, enabling true parallelism. Linux implements this model via the clone() system call.
N:1 Threading (User-level): Maps $N$ user threads to a single kernel process. User-space code (a user-level scheduler) manages thread switching without kernel involvement. Yields near-zero context switch costs but cannot utilize multiple processors for true parallelism.
N:M Threading (Hybrid): Maps $N$ user threads onto $M$ kernel threads ( $N \geq M$ ). Attempts to combine the cheap context switching of N:1 with the parallelism of 1:1. Extremely complex to implement. Historically optimized using Scheduler Activations (e.g., in NetBSD/FreeBSD) before being largely abandoned for 1:1 models.
Coroutines and Fibers: Highly lightweight, user-space constructs scheduled cooperatively rather than preemptively. Execution requires explicit yields. Primarily used for controlling program flow rather than hardware concurrency.

Once the underlying threading model is established by the operating system, applications must implement a structural pattern to assign workloads to these execution units.

Threading Patterns

Thread-per-Connection:
- Assigns a dedicated thread to a single unit of work (e.g., a network connection) for the duration of its execution.
- Executes using a “run until completion” paradigm.
- Permits synchronous, blocking I/O because a blocked thread only stalls its specific connection.
- Constrained by scalability limits: allocating separate user and kernel stacks for thousands of threads degrades performance.
Event-Driven Threading:
- Decouples waiting from execution to maximize efficiency.
- Utilizes asynchronous I/O, callbacks, and an event loop via multiplexed I/O.
- Operates efficiently with a small thread pool (typically one thread per processor) to achieve true parallelism without the overhead of massive thread counts.

Regardless of the selected structural pattern, the concurrent execution of multiple threads sharing a single address space introduces severe data integrity risks.

Concurrency, Parallelism, and Race Conditions

Concurrency: The execution of multiple threads over overlapping time periods. Forward progress is made, but execution is not necessarily simultaneous (e.g., multitasking on a single processor).
Parallelism: A specific subset of concurrency where threads execute simultaneously on multiple physical processors.
Race Conditions: Erroneous program behavior caused by the unsynchronized, concurrent access of shared resources by multiple threads. The outcome depends unpredictably on the exact order of thread execution.
Data Races: The most common race condition, occurring when threads simultaneously read and write shared memory.
Critical Regions: The specific window of code that accesses shared resources. Correct behavior requires that execution through this region does not interleave with other threads.
Atomicity: An operation is atomic if it is indivisible and appears instantaneous to the rest of the system. Critical regions must be made atomic to prevent races. Even simple C expressions like $x + +$ are non-atomic (requiring distinct load, add, and store machine instructions) and will fail under concurrent execution.

To enforce atomicity within critical regions and eliminate race conditions, developers must implement strict synchronization mechanisms.

Synchronization

Mutexes (Locks): The primary mechanism for enforcing mutual exclusion. A mutex guarantees that only one thread can execute within a critical region at a time. Threads attempting to acquire a held mutex will block until it is released.
Lock Data, Not Code: Synchronization must be associated with the shared data structures themselves, not the functions operating on them. Accessing the shared data mandates holding its specific lock.
Deadlocks: A fatal state where two or more threads block indefinitely, each waiting for a mutex held by the other.
- ABBA Deadlock (Deadly Embrace): Thread 1 acquires Mutex A, Thread 2 acquires Mutex B. Thread 1 blocks waiting for B, and Thread 2 blocks waiting for A.
- Prevention: Define and strictly enforce a hierarchical order for mutex acquisition.
Priority Inversion: A scheduling failure where a high-priority thread blocks waiting for a mutex held by a low-priority thread. If a medium-priority thread preempts the low-priority thread, the high-priority thread is indirectly starved by the medium-priority thread.
- Priority Inheritance: Solves priority inversion by temporarily elevating the priority of the lock-holding thread to match the highest-priority thread waiting for the lock.

Linux supplies these exact synchronization primitives and threading capabilities through a standardized library implementation.

Pthreads (POSIX Threads)

API Standard: IEEE Std 1003.1c-1995 (POSIX.1c) defines the standard Unix threading API, utilizing the pthread_ prefix.
Linux Implementations:
- LinuxThreads: The legacy 1:1 implementation. Scaled poorly and exhibited nonconformance due to reliance on signals for inter-thread communication and a lack of kernel support.
- Native POSIX Thread Library (NPTL): The modern standard. Resolves conformance issues and scales to thousands of threads. Leverages Linux 2.6 kernel additions like the futex() system call (for fast user-space locking), exit_group(), and kernel thread-local storage (TLS).
Compilation: Linking Pthreads requires the -pthread GCC flag, which also sets necessary preprocessor defines for thread safety.

The Pthreads API provides a comprehensive suite of functions for managing the lifecycle of these native threads.

Thread Management

Thread Creation: pthread_create() initializes a new thread.
- Executes a designated start_routine accepting and returning a void *.
- The new thread shares the address space, signal handlers, and open files of the creator.
Thread IDs (TID): Represented by the opaque pthread_t type.
- Assigned by the Pthread library, distinct from the kernel’s internal PID.
- pthread_self() retrieves the calling thread’s TID.
- pthread_equal() compares two TIDs for equality (required since pthread_t is not guaranteed to be an arithmetic type).
Thread Termination:
- Returning natively from the start_routine.
- pthread_exit(): Terminates the calling thread and passes a return value to any joining thread.
- Process-wide termination occurs if exit(), execve(), or a return from main() is executed by any thread.
Thread Cancellation: pthread_cancel() requests the termination of a peer thread. Execution depends on:
- Cancellation State: Enabled (default) or Disabled. Managed via pthread_setcancelstate().
- Cancellation Type: Asynchronous (terminates immediately, highly dangerous if holding locks) or Deferred (terminates only at safe, POSIX-defined cancellation points). Managed via pthread_setcanceltype().
Joining and Detaching:
- pthread_join(): Blocks the caller until the target thread terminates. Synchronizes execution and retrieves the target’s exit value.
- pthread_detach(): Marks a thread to automatically release its system resources upon termination without requiring a join.

With threads actively created, running, and joining, the Pthread API provides the requisite synchronization objects to protect their shared memory.

Pthread Mutexes

Initialization: Mutexes are defined by the opaque pthread_mutex_t type. Statically allocated mutexes are initialized using PTHREAD_MUTEX_INITIALIZER.
Locking: pthread_mutex_lock() blocks the caller until the mutex is acquired.
Unlocking: pthread_mutex_unlock() releases the mutex, immediately waking a blocked thread waiting on the lock.
Scoped Locks (RAII): In C++, Resource Acquisition Is Initialization is heavily used to manage Pthread mutexes safely. A wrapper object acquires the lock in its constructor and releases it via pthread_mutex_unlock() in its destructor, guaranteeing lock release even if exceptions are thrown or functions exit early.

My Knowledge Base

Explorer

07 Threading