Multicore
Modern high-performance server scaling relies almost exclusively on multicore processor architectures. To address varying workloads and multi-chip scalability, leading architectures employ distinct organizational approaches for cores, caches, and memory interfaces:
- Intel Xeon Platinum:
- Utilizes a deeply integrated microarchitecture with up to cores per chip or up to cores in a -chip chiplet configuration.
- Features a distributed MiB Last Level Cache (LLC) connected by multiple internal routing rings.
- Operates at slightly lower clock rates than desktop equivalents to remain within stringent thermal power limits.
- IBM Power10:
- Houses up to cores per chip, with each core directly coupled to an MiB L3 cache bank.
- Connects distributed L3 caches and up to independent memory channels via parallel on-chip routing rings.
- AMD EPYC Milan:
- Utilizes a highly modular chiplet design, incorporating cores and a shared LLC per chiplet.
- Standard sockets contain chiplets, maintaining a bidirectional -byte wide ring interconnect within each chiplet.
| Feature | IBM Power10 | Intel Xeon Platinum | AMD EPYC |
|---|---|---|---|
| Cores/chip;MCM | 4–15 per chip; 8–30 per socket | 4–60 per chip; 8–120 per socket | 8 per chip; 64 per socket |
| Multithreading | SMT | SMT | SMT |
| Threads/core | 8 | 2 | 2 |
| Clock rate | 4.15 GHz | 2.2–3.6 GHz | 3.5 GHz |
| L1 I cache | 96 KiB per core | 32 KiB per core | 32 KiB per core |
| L1 D cache | 64 KiB per core | 32 KiB per core | 32 KiB per core |
| L2 cache | 2 MiB per core | 1 MiB per core | 512 KiB per core |
| L3 cache | 8 MiB per core; shared with nonuniform access time | 22–77 MiB @ 1.375 MiB per core; shared, with larger core counts | 4 MiB per core to 12 MiB per core using stacked memory; shared |
| Inclusion | L2 inclusive | L2 inclusive | L3 for shared blocks |
| Multicore coherence protocol | Extended MESI with behavioral and locality hints (13-states) | MESIF: extended MESI allowing direct transfers of clean blocks | MDOEFSI: extended MOESI with dirty and forward states |
| Multichip coherence | Hybrid: primarily snooping with limited directory | Hybrid: snooping and directory at L3 | Hybrid: snooping and directory (copy of L2 tags) at L3 |
| Multiprocessor interconnect | Up to 16 processor chips with 1 or 2 hops to any processor | Up to 8 processor chips via UPI | Infinity Fabric connects 4 chiplets per socket, up to 2 sockets |
| Processor chip range | 1–16 | 2–8 | 1–2 |
| Core count range | 4–240 | 8–480 | 8–128 |
Interconnection Strategies
To scale beyond a single processor, multicore architectures implement sophisticated inter-chip topologies that inherently create Non-Uniform Cache Access (NUCA) and Non-Uniform Memory Access (NUMA) environments:
- IBM Power10 Interconnect:
- Supports up to discrete chips ( cores total).
- Groups processor chips into a fully connected module via intragroup links.
- Intergroup links connect each chip to three other modules, ensuring every processor is reachable within one or two network hops.
- Intel Xeon UPI:
- Uses Ultra Path Interconnect (UPI) links per processor to interface with neighboring chips.
- Supports up to sockets (up to cores), routed so that any socket is at most two hops away from any other.
- AMD Infinity Fabric:
- Provides full connectivity among the to chiplets residing within a single socket.
- Dual-socket configurations use remaining Infinity Fabric lanes to bridge the sockets, maintaining the one-to-two hop maximum distance across the entire system.
IBM Power chip: up to 15 cores each with 2 MiB L2 and 8 MiB L3 bank, unified on-chip coherence and data interconnect, off-chip intragroup and intergroup SMP links, Open CAPI accelerator interface, PCIe Gen 5 I/O, and 16×8 memory channels.

IBM Power8 multi-chip topology: up to 16 chips arranged in groups of 4. Within a group, chips are fully connected via a 78.4 GB/s intragroup bus. Groups connect to one another via 25.6 GB/s intergroup cables, keeping any chip within two hops of any other.

Performance Scaling
Scaling behavior for independent, multiprogrammed tasks depends heavily on the underlying interconnect and memory bandwidth rather than cache coherency overhead. Measured using the SPECintRate benchmark up to cores, scaling efficiencies diverge significantly among architectures:
- IBM Power10: Exhibits near-linear scalability, achieving efficiency at cores relative to its baseline -core configuration.
- Intel Xeon Platinum: Experiences scaling degradation at high core counts, yielding approximately efficiency at cores (relative to an -core baseline) and eventually dropping to efficiency.
- AMD EPYC: Achieves superior raw performance at low core counts but encounters systems-level or architectural limitations as it scales, dropping to efficiency at cores relative to a -core baseline.