Consistency

Cache coherence guarantees that all processors eventually agree on the value of a single memory location. Memory consistency is a separate question: when exactly does a write to one location become visible relative to writes and reads of other locations?

Consider two processors with A and B initially 0 and cached by both:

P1	P2
A = 1	B = 1
L1: if (B == 0) …	L2: if (A == 0) …

If writes take effect immediately and are visible to all processors at once, both L1 and L2 cannot evaluate as true simultaneously — reaching either if means the other processor’s write must have already happened. But if write invalidations are delayed and each processor continues executing before the other’s invalidation arrives, P1 may not yet see B = 1 when it evaluates L1, and P2 may not yet see A = 1 when it evaluates L2 — making both conditions true at the same time. Whether this is allowed, and under what conditions, is exactly what a memory consistency model defines.

Sequential consistency requires every execution to appear as if each processor’s accesses run in program order, with all processors’ accesses arbitrarily interleaved into one global order. A processor must stall on any memory access until all invalidations from that access are acknowledged. The performance cost motivates a programmer-side contract instead: assume programs are synchronized.

Synchronized program: every shared-data write by one processor and subsequent access by another are separated by a sync pair — an unlock after the write, a lock before the read.
Data race: accesses without such ordering; outcome depends on processor speed and is unpredictable. Also called data-race-free when absent.
Most programs are synchronized in practice — unsynchronized shared access is too unpredictable to reason about.
Use standard sync libraries, not custom primitives — custom schemes are brittle and may not hold across hardware generations.
A data-race-free program behaves as sequentially consistent even on relaxed hardware.

Relaxed Consistency Models

The key idea is to allow reads and writes to complete out of order, but use synchronization operations to enforce ordering so that a synchronized program still behaves as sequentially consistent. Models are classified by which of the four orderings they relax.

The notation $X \to Y$ means $X$ must complete before $Y$ begins:

Total Store Order (TSO) / Processor Consistency — relaxes only $W \to R$ . Write ordering is preserved, so many programs work without extra synchronization.
Partial Store Order (PSO) — relaxes $W \to R$ and $W \to W$ .
Weak Ordering / Release Consistency — relaxes all four orderings. RISC-V, ARMv8, C, and C++ chose release consistency for its performance advantages.

Model	Ordinary orderings
Sequential consistency	$R \to R$ , $R \to W$ , $W \to R$ , $W \to W$
Total Store Order (Processor Consistency)	$R \to R$ , $R \to W$ , $W \to W$
Partial Store Order	$R \to R$ , $R \to W$
Weak Ordering	—
Release Consistency	—

Release consistency splits synchronization into $S_{A}$ (acquire) and $S_{R}$ (release), based on the observation that in synchronized programs an acquire always precedes use of shared data and a release always follows updates. This allows two relaxations:

A read or write preceding an acquire need not complete before the acquire.
A read or write following a release need not wait for the release.

Ordering is only preserved between $S_{A}$ , $S_{R}$ , and ordinary accesses — the fewest constraints of any checkable model that still guarantees sequentially consistent execution for synchronized programs. Barriers act as both acquire and release, making their ordering equivalent to weak ordering. When ordering is needed without any identified sync operation, FENCE (RISC-V) guarantees all previous instructions in the thread have completed, including all writes and their invalidates.

Arrows indicate required ordering — sequential consistency orders everything; each weaker model removes arrows until release consistency enforces ordering only at acquire/release boundaries.

Compiler Optimization and the Consistency Model

The compiler faces the same reordering problem as the hardware. Without explicit sync points, it cannot legally reorder reads and writes to different shared variables — doing so could change program semantics. This rules out common optimizations like keeping shared variables in registers across accesses. Shared data is also frequently accessed through pointers or array indexing, which further limits what compilers can safely infer and optimize. This is partly why RISC-V designers chose release consistency — to leave room for future compiler optimizations without breaking correctness.

Using Speculation to Hide Latency

An OOO processor can use speculation to get most of the performance benefit of a relaxed model while presenting a sequentially consistent interface to the programmer. The processor reorders memory references dynamically, but monitors for coherence violations. If an invalidation arrives for a memory reference before it commits, the processor uses speculation recovery to roll back and restart from that point. The restart is rare — it only triggers on actual unsynchronized races.

This approach is preferable to relaxed models for three reasons:

An aggressive speculative implementation of sequential or processor consistency captures most of the performance of relaxed models.
It adds minimal complexity to a processor that already supports OOO execution.
Programmers can reason using simpler, stricter consistency models.

The MIPS R10000 used this approach in the mid-1990s, leveraging its OOO capability to implement sequential consistency aggressively.

My Knowledge Base

Explorer

4 Consistency

Consistency

Relaxed Consistency Models

Compiler Optimization and the Consistency Model

Using Speculation to Hide Latency