Maintaining Cache Coherence with Snooping

Snooping Coherence Fundamentals Snooping maintains cache coherence for a single processor or a small group of processors connected by a broadcast medium, such as a bus.

  • Invalidation Process:
    • To write to a shared block, a processor acquires access to the broadcast medium and broadcasts the target address.
    • All other processors continuously snoop the medium, compare the broadcast address against their local cache tags, and invalidate the block if a match is found.
  • Write Serialization:
    • Simultaneous writes to the same block arbitrate for access to the broadcast medium; the winner executes its broadcast and invalidates competing copies.
    • The losing processor must fetch a new, updated copy of the data before it can complete its write.
    • Writes to shared data fundamentally stall and cannot complete until broadcast medium access is secured.
  • Locating Data:
    • In systems with write-through caches, the most recent value is always retrievable from main memory.
    • In systems with write-back caches, the most recent value may reside in a private cache; the processor holding the modified (dirty) copy detects the read miss on the broadcast medium, provides the block to the requester, and forces the main memory access to abort.
    • Because write-back caches significantly reduce global memory bandwidth demands, they are standard across multicore architectures for L2 and L3 caches.

To systematically implement these invalidation and location mechanisms, cache controllers utilize finite-state machines to track the exact sharing status of each cache block.

The Baseline MSI Protocol and State Tracking Coherence is managed by a finite-state controller in each core that responds to both local processor operations and external broadcast requests. The baseline protocol utilizes three states (MSI):

  • Modified (M): The cache block has been updated in the private cache and is exclusive to that specific cache.
  • Shared (S): The cache block is unmodified, is up-to-date in main memory, and is potentially cached by multiple processors.
  • Invalid (I): The cache block does not contain valid data.
  • State Transitions:
    • Any valid memory block exists either in the S state across one or more caches or in the M state in exactly one cache.
    • A local write operation requires transitioning the block to the M state, which mandates placing an invalidate or write miss on the broadcast medium to force all remote caches to transition their matching blocks to the I state.
    • If a remote read miss is snooped for a block currently in the M state, the owning cache writes back the data to memory and downgrades its local state to S.
  • Tag Management:
    • Every broadcast transaction requires checking cache-address tags, which can interfere with the processor’s own cache accesses.
    • Hardware often duplicates cache tags to allow background snoop accesses without halting processor execution.

While the MSI protocol provides the foundational logic for cache coherence, real-world implementations require structural extensions to handle specific access patterns and the inherent latency of hardware operations.

Protocol Extensions and Implementation Realities To optimize network traffic and maintain physical correctness, the basic MSI protocol is augmented with additional states and serialization guarantees.

  • State Extensions:
    • MESI Protocol: Adds an Exclusive (E) state denoting a block that is resident in only a single cache but remains unmodified.
      • A block in the E state can be written to (transitioning to M) without generating an invalidation broadcast, heavily optimizing data that is read and then subsequently written by the same processor.
      • If another processor reads the block, the state downgrades to S.
      • Intel architectures utilize a variant called MESIF, which adds a Forward (F) state to designate exactly which sharing processor is responsible for responding to a miss.
    • MOESI Protocol: Adds an Owned (O) state to the MESI protocol.
      • Allows a block to transition from M to O without writing back to main memory, indicating that the cache is the owner and the main memory copy is stale.
      • Other caches reading the block receive it in the S state, and the owner (in the O state) is responsible for supplying the value on a miss and writing it back upon replacement.
  • Atomicity and Serializability Constraints:
    • Serializability: All coherence events must pass through a single common serialization point (such as a shared bus or shared cache) to ensure every processor observes events in the exact same order.
    • Atomicity: Coherence events span multiple clock cycles but must logically appear atomic to the system.
      • Achieved by introducing transient (hidden) states within the cache controller while it waits for a memory or bus response.
      • If a cache block in a transient state receives an external write miss or invalidate, it immediately reverts to the I state, forcing the local processor to retry the operation upon completion of the current interference.

Even with optimized state protocols and transient state handling, the necessity of broadcasting every coherence event fundamentally limits the scalability of snooping architectures.

Scalability Limits and Snoop Bandwidth As multiprocessor systems grow, centralized broadcast resources and the snoop bandwidth at individual caches become severe physical bottlenecks.

  • Snoop Bandwidth Contention:
    • In a pure snooping system, every cache must process every miss in the system, threatening to consume all available cache bandwidth.
    • For a theoretical 16-processor system (4.0 GHz clock, CPI of 0.5, 40% load/store frequency, 15-cycle L2 request), keeping coherence traffic below 50% of total L2 bandwidth requires the coherence miss rate to remain strictly below .
  • Bandwidth Scaling Techniques:
    • Inclusive Shared LLC: A shared Last Level Cache (LLC) acts as a primary snoop filter. Snoops interrogate the LLC first; L2 caches are only snooped if the LLC confirms a hit, isolating private caches from irrelevant coherence traffic.
    • Point-to-Point Snooping: Architectures like AMD Opteron replace the shared bus with point-to-point links. They broadcast to connected chips and use explicit acknowledgment messages to determine when an invalidation has physically completed across the network.

When evaluating the performance impact of these coherence mechanisms, it is critical to distinguish between misses caused by data capacity constraints and those directly induced by the sharing of data.

Multiprocessor Cache Performance and Sharing Misses Overall cache performance in a multiprocessor is an amalgamation of standard uniprocessor misses (compulsory, capacity, conflict) and coherence misses.

  • Classification of Coherence Misses:
    • True Sharing Misses: Occur when a processor writes to a shared block (invalidating it globally), and another processor subsequently misses when attempting to read that newly updated word.
    • False Sharing Misses: Occur when a processor invalidates a block by writing to a word, causing a miss for another processor that is reading a completely different word residing within the identical cache block. This miss is purely an artifact of the block size and would not occur if the block size was a single word.
  • Workload Scaling Dynamics (OLTP Workload Example):
    • Impact of Cache Size: Increasing the L3 cache size systematically eliminates uniprocessor capacity and conflict misses. However, true and false sharing misses remain completely static regardless of cache size, causing coherence misses to dominate overall memory stall time in large caches (e.g., 4–8 MiB).
    • Impact of Block Size:
      • Increasing block size (e.g., from 32 bytes to 256 bytes) effectively reduces compulsory and capacity/conflict misses due to increased spatial locality.
      • However, false sharing misses nearly double with larger block sizes, as a larger block encompasses more independent data structures, greatly increasing the probability of false invalidations.
      • While larger blocks lower the absolute miss rate, they significantly amplify total data traffic to memory, creating contention that can easily negate the performance gains achieved by the lowered miss rate.