Concurrency

Introduction to Concurrency

A multi-threaded process contains multiple points of execution, functioning similarly to a traditional single-threaded process but sharing the same address space.

Thread State: Each thread maintains its own program counter (PC) and private set of registers for computation.
Thread Control Block (TCB): Context switches between threads require saving and restoring register states into one or more TCBs, though the address space remains identical.
Thread-Local Storage: Multi-threaded address spaces contain multiple stacks (one per thread) to hold local variables, parameters, and return values.
Concurrency Hazards:
- Race Condition: Arises when multiple threads enter a critical section simultaneously, causing output to depend on exact instruction execution timing.
- Indeterminate State: Output varies across runs due to uncontrollable context switches.
- Critical Section: A segment of code accessing a shared resource or variable that must not be concurrently executed by multiple threads.
Mutual Exclusion: A property guaranteeing that if one thread executes within a critical section, all other threads are prevented from doing so.
Atomicity: Executing a series of actions as a single, uninterruptible unit (“all or nothing”).

To execute critical sections atomically and enforce mutual exclusion, the system must provide specialized synchronization primitives.

Thread API

Multi-threaded programming requires an API for thread creation, completion, and synchronization.

Thread Creation: pthread_create spawns a new thread, executing a specified function pointer. It utilizes void * pointers to pass arbitrary argument types and return arbitrary results.
Thread Completion: pthread_join forces the calling thread to wait until a specified thread terminates.
Memory Hazards: Threads must never return pointers referencing locally-allocated variables on their call stack, as the stack is deallocated upon function return.
Mutual Exclusion (Locks): pthread_mutex_lock and pthread_mutex_unlock wrap critical sections. Locks must be explicitly initialized dynamically via pthread_mutex_init or statically via PTHREAD_MUTEX_INITIALIZER.
Condition Variables: pthread_cond_wait and pthread_cond_signal facilitate signaling between threads.

The most fundamental synchronization primitive within this API is the lock, which enforces the required mutual exclusion around shared memory updates.

Locks

A lock is a variable storing a state (available/free or acquired/held) that restricts critical section access to a single owner thread.

Evaluation Criteria:
- Correctness: Does the lock successfully provide mutual exclusion?
- Fairness: Do threads have equal opportunity to acquire the lock, or does contention cause starvation?
- Performance: What is the CPU time overhead under no contention, single-CPU contention, and multi-CPU contention?
Interrupt Disabling: The earliest mutual exclusion mechanism involved turning off hardware interrupts during critical sections.
- Drawbacks: Requires trusting user applications with privileged operations, fails entirely on multiprocessor systems, risks losing critical hardware interrupts, and executes slowly.
Hardware Synchronization Primitives:
- Test-and-Set (Atomic Exchange): Atomically returns the old value at a memory location while updating it to a new value. Used to build spin locks by endlessly spinning in a while loop until a $0$ is returned.
- Compare-and-Swap (CAS): Atomically updates a memory location to a new value only if the current value matches an expected value.
- Load-Linked and Store-Conditional (LL/SC): Load-Linked fetches a value. Store-Conditional updates the value only if no intervening stores to that address occurred, returning $1$ on success and $0$ on failure.
- Fetch-and-Add: Atomically increments a value and returns the old value. Used to build ticket locks, which assign a ticket number to each thread to guarantee fairness and prevent starvation.
OS Support to Prevent Spinning: Pure spin locks waste entire CPU time slices when contending for a locked resource.
- Yielding: A thread failing to acquire a lock calls yield() to proactively move itself from the running state to the ready state, surrendering the CPU.
- Queues and Sleeping: To prevent starvation and control lock acquisition order, threads add themselves to a queue and sleep (e.g., Solaris park()) when a lock is held. The releasing thread wakes the next queued thread (e.g., unpark()).
- Futex: Linux provides futex_wait to sleep if a memory value matches an expected state, and futex_wake to rouse sleeping threads.
- Two-Phase Locks: A hybrid approach where a thread spins for a fixed duration hoping the lock is released quickly, then transitions to sleeping if the lock remains held.

With robust locks established, these mechanisms can be applied to standard data structures to ensure thread-safe concurrent access and scalable performance.

Lock-Based Concurrent Data Structures

Adding locks to data structures makes them thread-safe, but naive locking limits concurrent performance.

Concurrent Counters:
- Precise Counter: Uses a single lock for all increments. Scales poorly across multiple CPUs.
- Sloppy Counter: Utilizes per-CPU local counters with local locks, and a single global counter with a global lock. Threads increment local counters to eliminate contention. When a local counter reaches a threshold $S$ , its value is transferred to the global counter. A high $S$ yields near-perfect scaling but sacrifices global count accuracy.
Concurrent Linked Lists:
- Single Lock: Acquires a lock over critical list manipulations. Requires careful control flow to release locks on error paths (e.g., malloc failure).
- Hand-over-Hand Locking: Associates one lock per node. Traversing threads acquire the next node’s lock before releasing the current node’s lock. Enables high concurrency but the overhead of continuous lock acquisition typically degrades absolute performance.
Concurrent Queues: Uses independent locks for the head and the tail to allow concurrent enqueues and dequeues. Requires a dummy node allocated at queue initialization to securely separate head and tail operations.
Concurrent Hash Tables: Utilizes an array of linked lists (buckets), allocating one lock per hash bucket rather than a single lock for the entire structure. Scales magnificently across multiple threads.

While locks successfully manage mutual exclusion for data structures, threads frequently must halt execution and wait for specific application states to change, requiring a distinct synchronization construct.

Condition Variables

Condition variables manage explicit queues of threads waiting for a specific program state to become true.

Routines:
- wait(cond, mutex): Assumes the mutex is currently held. Atomically releases the lock and puts the calling thread to sleep. Upon waking, the thread re-acquires the lock before returning.
- signal(cond): Wakes a single thread sleeping on the condition.
Mesa vs. Hoare Semantics: Most operating systems utilize Mesa semantics, where waking a thread is merely a hint that the state changed. Because another thread could execute and alter the state before the woken thread runs, the condition must always be re-checked inside a while loop, not an if statement. while loops also safely handle spurious wakeups.
Producer/Consumer (Bounded Buffer) Problem:
- Producers generate data for a fixed-size buffer; consumers extract it.
- Requires a mutex lock around buffer updates.
- Must utilize two distinct condition variables (empty and fill). Using a single condition variable can result in a consumer improperly waking another consumer (instead of a producer), leading to all threads sleeping indefinitely.
Covering Conditions: When a single event can satisfy the waiting conditions of multiple threads with disparate needs (e.g., a memory allocator freeing a block of memory), pthread_cond_broadcast() is used to wake all waiting threads. This conservatively guarantees the correct thread proceeds, at the cost of unnecessary context switches.

Locks and condition variables serve distinct but complementary roles, and they can be unified into a single, historically significant synchronization abstraction known as a semaphore.

Semaphores

A semaphore is a stateful object containing an integer value manipulated by two routines: sem_wait() and sem_post().

Operations:
- wait(): Decrements the semaphore’s value by one. If the value becomes negative, the calling thread sleeps.
- post(): Increments the semaphore’s value by one. If threads are waiting, it wakes one up.
- Invariant: When a semaphore’s value is negative, the absolute value equals the number of queued, sleeping threads.
Binary Semaphores (Locks): Initializing a semaphore to $1$ allows it to function as a mutual exclusion lock. The first thread decrements it to $0$ and proceeds; subsequent threads decrement it to negative values and sleep.
Semaphores as Condition Variables: Initializing a semaphore to $0$ allows a thread to wait for another thread to complete a task (e.g., a parent waiting for a child). The parent calls wait() and sleeps; the child finishes, calls post(), and wakes the parent.
Producer/Consumer with Semaphores: Requires three semaphores: empty (initialized to buffer size), full (initialized to $0$ ), and mutex (initialized to $1$ ). The mutex must strictly surround only the buffer insertion/extraction code. If a thread acquires the mutex before waiting on the empty or full semaphores, deadlock will occur.
Reader-Writer Locks: Facilitates concurrent read operations but exclusive write operations. The first reading thread acquires the write lock on behalf of all readers; the last reading thread releases it. This design can lead to writer starvation.
Dining Philosophers Problem: Five philosophers compete for five shared forks. If every philosopher grabs the left fork first, a cyclic dependency causes deadlock. Breaking the dependency—such as having the last philosopher acquire the right fork before the left fork—prevents deadlock.
Zemaphores: Semaphores can be manually implemented using a lock, a condition variable, and a state integer. Conversely, implementing condition variables using only semaphores is highly error-prone.

Designing systems with locks, condition variables, and semaphores frequently introduces logic and dependency errors, necessitating formal classification and prevention of concurrency bugs.

Common Concurrency Problems

Concurrency bugs heavily manifest in mature code bases, broadly dividing into non-deadlock and deadlock categories.

Non-Deadlock Bugs (97% of occurrences):
- Atomicity Violations: A code region intended to execute atomically is interrupted, violating serializability (e.g., checking a pointer and then dereferencing it, while another thread nullifies it in between). Fix: Wrap the shared memory accesses in mutual exclusion locks.
- Order Violations: The strict chronological execution sequence required between two memory accesses is flipped (e.g., accessing an uninitialized variable because the initializing thread hasn’t run). Fix: Enforce ordering via condition variables or semaphores.
Deadlock Bugs: Arise when threads hold locks while requesting additional locks held by other threads, creating a cycle. Deadlocks require four conditions to manifest:
1. Mutual Exclusion: Threads claim exclusive control of resources.
2. Hold-and-Wait: Threads hold allocated resources while waiting for new ones.
3. No Preemption: Resources cannot be forcibly removed from holding threads.
4. Circular Wait: A cyclical chain of threads exists where each requests a resource held by the next.
Deadlock Prevention:
- Eliminate Circular Wait: Establish a total or partial ordering on lock acquisition (e.g., always acquire locks in memory address order).
- Eliminate Hold-and-Wait: Acquire all necessary locks simultaneously by first wrapping the acquisitions in a master global lock.
- Eliminate No Preemption: Utilize trylock(). If a lock is unavailable, the thread releases its current locks and jumps back to the beginning to try again. This prevents deadlock but introduces the risk of livelock, where threads repeatedly loop and fail without making progress.
- Eliminate Mutual Exclusion: Utilize wait-free data structures constructed entirely from hardware atomic instructions (e.g., Compare-and-Swap).
Deadlock Avoidance: Schedulers equipped with total prior knowledge of lock dependencies (e.g., Dijkstra’s Banker’s Algorithm) can statically restrict specific threads from running concurrently to guarantee cycles never form.
Detection and Recovery: Periodically build resource graphs to detect cycles, recovering by killing processes or rebooting the system.

To sidestep the deadlocks, data races, and implicit complexities inherent in threaded synchronization, some systems deploy an entirely different architectural paradigm.

Event-Based Concurrency

Event-based concurrency bypasses threaded execution by structuring a server around a single-threaded event loop.

The Event Loop: The application waits for events to occur and dispatches them sequentially to designated event handlers. Because execution is single-threaded, explicit locks are not required, eliminating traditional concurrency bugs.
select() and poll() APIs: These system calls receive descriptor sets and return which file/socket descriptors are ready for reading or writing. This allows a server to simultaneously monitor thousands of network connections.
The Blocking I/O Problem: Because the entire architecture relies on a single thread, invoking any blocking system call halts the entire application.
Asynchronous I/O: To access disks without blocking, systems utilize asynchronous interfaces (e.g., aio_read). The application populates an I/O control block, issues the request, and continues executing. The OS informs the application of I/O completion either via polling (e.g., aio_error) or through UNIX signals (software interrupts).
State Management (Continuations): In threaded code, local variables are automatically saved on the stack during blocking I/O. Event-based handlers must execute manual stack management by packaging required state into a continuation data structure (e.g., a hash table). When the asynchronous I/O completes, the subsequent handler extracts the continuation to resume the logical sequence.
Limitations: Event-based applications cannot easily utilize multiple CPUs without reintroducing threads and locks. Additionally, implicit blocking from page faults remains unavoidable and can severely impact event loop throughput.

My Knowledge Base

Explorer

04 Concurrency