Distributed Systems and File Systems
Distributed Communication Fundamentals
Network communication is fundamentally unreliable. Packets are routinely lost or corrupted due to bit errors, severed links, and buffer capacity exhaustion within network switches and endpoint routers.
To guarantee message delivery over unreliable datagram layers (e.g., UDP/IP), systems must implement reliable communication primitives:
- Acknowledgments (ACKs): The receiver transmits a short confirmation message upon receiving data.
- Timeout and Retry: The sender retains a copy of the transmitted message and initializes a timer. If the timer expires before an ACK is received, the sender assumes packet loss and re-transmits the message.
- Sequence Counters: If an ACK is lost, a retry will cause the receiver to process the same message twice. To enforce exactly-once semantics, senders attach a monotonically increasing ID (e.g., , ) to each message. Receivers track the highest seen ID, acknowledging duplicates without passing the payload to the application layer.
The implementation of these low-level mechanisms is governed by the End-to-End Argument, which dictates that critical functionality (like reliable file transfer) must ultimately be validated at the highest application level, as lower-level protocols cannot guarantee absolute integrity across the entire hardware/software stack.
To hide the complexity of timeouts, retries, and sequencing from applications, systems encapsulate these mechanisms within higher-level programming abstractions.
Communication Abstractions
Distributed systems require structural abstractions to manage execution across multiple physical machines.
- Distributed Shared Memory (DSM): Extends a single virtual address space across networked machines via page faults. DSM suffers from severe performance penalties and catastrophic failure models; if one machine crashes, arbitrary data structures lose components, corrupting the entire address space.
- Remote Procedure Call (RPC): Abstracts remote execution to mimic local function calls. An RPC framework consists of two primary components:
- Stub Generator (Protocol Compiler): Automates the packaging of data.
- Marshaling (Serialization): The client stub packs the function identifier and memory arguments into a contiguous byte buffer.
- Unmarshaling (Deserialization): The server unpacks the message, executes the function, and marshals the return values.
- Run-Time Library: Manages the underlying network transport.
- Frequently built on UDP rather than TCP. Because RPC is natively a request/response protocol, TCP’s built-in ACKs are redundant and cause performance inefficiencies.
- Handles byte ordering (endianness) across distinct architectures using standards like XDR (eXternal Data Representation).
- Maintains server-side thread pools to process multiple RPC invocations concurrently.
- Stub Generator (Protocol Compiler): Automates the packaging of data.
RPC mechanisms provide the reliable execution foundation necessary for building robust distributed services, most notably distributed file systems optimized for disparate operational parameters.
Sun’s Network File System (NFS)
NFS operates as a client/server distributed file system with a primary design mandate: simple and fast server crash recovery.
To achieve instantaneous recovery, NFS utilizes a stateless protocol. The server tracks zero client state—it maintains no record of open files, file pointer offsets, or client cache contents. If the server crashes and reboots, it simply begins processing new requests without requiring a complex state-reconciliation handshake.
- File Handles: Because the server does not track open file descriptors, clients must explicitly identify target files in every request. A file handle consists of:
- Volume Identifier: Identifies the specific file system partition.
- Inode Number: The low-level identifier for the file/directory.
- Generation Number: Incremented upon inode reallocation to prevent clients from interacting with deleted and reassigned inodes.
- Idempotent Operations: Because network requests may fail or timeout, clients must be able to safely retry operations. NFS structures operations to be idempotent, meaning executing them multiple times yields the identical system state as executing them once.
LOOKUP,READ, andWRITE(which specifies an exact byte offset rather than relying on an implicit file pointer) are inherently idempotent.
NFS Caching and Consistency
NFS clients cache data and metadata in local memory to reduce network latency, which introduces the cache consistency problem:
- Update Visibility: To ensure updates from one client eventually reach the server, NFS enforces flush-on-close (close-to-open) semantics. Dirty cached pages are forcibly transmitted to the server when the client application issues a
close()syscall. - Cache Staleness: To prevent clients from reading outdated cached data, clients issue
GETATTRrequests to verify the file’s server-side modification time. - Attribute Cache: To prevent
GETATTRrequests from overwhelming the server, clients cache attributes locally with a strict timeout (e.g., 3 seconds), eliminating network traffic for consecutive accesses within the time window.
Server-Side Constraints
NFS servers must commit WRITE requests to stable, persistent storage before returning a success signal to the client. If a server buffers writes in volatile memory and crashes, the client falsely assumes success. Upon retry of subsequent writes, file blocks will be intermingled with old, unwritten data.
While NFS optimizes for stateless recovery, its attribute polling and block-based transfers limit scalability, driving the need for highly scalable, stateful architectures.
The Andrew File System (AFS)
AFS was engineered to maximize server scalability, allowing a single server to support vastly more clients than NFS.
AFS replaces block-level memory caching with whole-file caching. When an application invokes open(), the AFS client fetches the entire file from the server and stores it on the client’s local hard disk. All subsequent read() and write() calls process entirely locally. Modified files are pushed back to the server in their entirety upon close().
AFSv1 Inefficiencies and AFSv2 Optimizations
The first iteration of AFS suffered from two bottlenecks that overloaded the server CPU:
- Path-Traversal Overheads: Clients passed full pathnames (e.g.,
/home/user/file.txt). The server had to recursively traverse the directory tree for every request.- AFSv2 Solution: File Identifiers (FIDs). Similar to the NFS file handle, FIDs eliminate server-side directory walking. Clients iteratively fetch directories, cache them, and supply exact FIDs to the server.
- Polling Overheads: Clients flooded the server with
TestAuthmessages to verify if cached files were still valid (analogous to the NFSGETATTRflood).- AFSv2 Solution: Callbacks. The server records state, promising to notify the client if a cached file is modified. Clients assume cached data is universally valid until the server explicitly severs the callback.
AFS Cache Consistency
Because AFS relies on whole-file caching and callbacks, its consistency model diverges from NFS:
- Cross-Machine: Employs last-writer-wins semantics. If two distinct clients modify a file concurrently, the client that calls
close()last overwrites the entire file on the server. - Local Machine: Concurrent updates by separate processes on the same client machine are immediately visible to one another, preserving standard POSIX semantics.
Crash Recovery and Performance Trade-Offs
Because callbacks inject state into the server, AFS crash recovery is complex. If the server reboots, it loses all callback state. Consequently, clients must treat all locally cached files as suspect upon reconnection and manually re-verify them via TestAuth.
AFS vs. NFS Performance:
- Large-File Sequential Re-reads: AFS drastically outperforms NFS. AFS serves the file from the local disk cache, whereas NFS must re-fetch the file from the network if it exceeds the client’s memory cache limit.
- Small Random Accesses in Large Files: NFS outperforms AFS. AFS forces a highly inefficient whole-file transfer to execute a minute read/write.
- Sequential Overwrites: NFS outperforms AFS. AFS requires fetching the entire old file from the server before executing the overwrite, whereas NFS safely overwrites discrete blocks directly to the server.
The architectural divergence between NFS’s stateless block-based model and AFS’s stateful whole-file model exemplifies the inherent system trade-offs between crash recovery simplicity and massive client scalability.