Multicore

Modern high-performance server scaling relies almost exclusively on multicore processor architectures. To address varying workloads and multi-chip scalability, leading architectures employ distinct organizational approaches for cores, caches, and memory interfaces:

Intel Xeon Platinum:
- Utilizes a deeply integrated microarchitecture with up to $40$ cores per chip or up to $224$ cores in a $4$ -chip chiplet configuration.
- Features a distributed $60$ MiB Last Level Cache (LLC) connected by multiple internal routing rings.
- Operates at slightly lower clock rates than desktop equivalents to remain within stringent thermal power limits.
IBM Power10:
- Houses up to $15$ cores per chip, with each core directly coupled to an $8$ MiB L3 cache bank.
- Connects distributed L3 caches and up to $16$ independent memory channels via $16$ parallel on-chip routing rings.
AMD EPYC Milan:
- Utilizes a highly modular chiplet design, incorporating $8$ cores and a shared LLC per chiplet.
- Standard sockets contain $4$ chiplets, maintaining a bidirectional $32$ -byte wide ring interconnect within each chiplet.

Feature	IBM Power10	Intel Xeon Platinum	AMD EPYC
Cores/chip;MCM	4–15 per chip; 8–30 per socket	4–60 per chip; 8–120 per socket	8 per chip; 64 per socket
Multithreading	SMT	SMT	SMT
Threads/core	8	2	2
Clock rate	4.15 GHz	2.2–3.6 GHz	3.5 GHz
L1 I cache	96 KiB per core	32 KiB per core	32 KiB per core
L1 D cache	64 KiB per core	32 KiB per core	32 KiB per core
L2 cache	2 MiB per core	1 MiB per core	512 KiB per core
L3 cache	8 MiB per core; shared with nonuniform access time	22–77 MiB @ 1.375 MiB per core; shared, with larger core counts	4 MiB per core to 12 MiB per core using stacked memory; shared
Inclusion	L2 inclusive	L2 inclusive	L3 for shared blocks
Multicore coherence protocol	Extended MESI with behavioral and locality hints (13-states)	MESIF: extended MESI allowing direct transfers of clean blocks	MDOEFSI: extended MOESI with dirty and forward states
Multichip coherence	Hybrid: primarily snooping with limited directory	Hybrid: snooping and directory at L3	Hybrid: snooping and directory (copy of L2 tags) at L3
Multiprocessor interconnect	Up to 16 processor chips with 1 or 2 hops to any processor	Up to 8 processor chips via UPI	Infinity Fabric connects 4 chiplets per socket, up to 2 sockets
Processor chip range	1–16	2–8	1–2
Core count range	4–240	8–480	8–128

Interconnection Strategies

To scale beyond a single processor, multicore architectures implement sophisticated inter-chip topologies that inherently create Non-Uniform Cache Access (NUCA) and Non-Uniform Memory Access (NUMA) environments:

IBM Power10 Interconnect:
- Supports up to $16$ discrete chips ( $240$ cores total).
- Groups $4$ processor chips into a fully connected module via intragroup links.
- Intergroup links connect each chip to three other modules, ensuring every processor is reachable within one or two network hops.
Intel Xeon UPI:
- Uses $3$ Ultra Path Interconnect (UPI) links per processor to interface with neighboring chips.
- Supports up to $8$ sockets (up to $448$ cores), routed so that any socket is at most two hops away from any other.
AMD Infinity Fabric:
- Provides full connectivity among the $4$ to $8$ chiplets residing within a single socket.
- Dual-socket configurations use remaining Infinity Fabric lanes to bridge the sockets, maintaining the one-to-two hop maximum distance across the entire system.

IBM Power chip: up to 15 cores each with 2 MiB L2 and 8 MiB L3 bank, unified on-chip coherence and data interconnect, off-chip intragroup and intergroup SMP links, Open CAPI accelerator interface, PCIe Gen 5 I/O, and 16×8 memory channels.

IBM Power8 multi-chip topology: up to 16 chips arranged in groups of 4. Within a group, chips are fully connected via a 78.4 GB/s intragroup bus. Groups connect to one another via 25.6 GB/s intergroup cables, keeping any chip within two hops of any other.

Performance Scaling

Scaling behavior for independent, multiprogrammed tasks depends heavily on the underlying interconnect and memory bandwidth rather than cache coherency overhead. Measured using the SPECintRate benchmark up to $240$ cores, scaling efficiencies diverge significantly among architectures:

IBM Power10: Exhibits near-linear scalability, achieving $97%$ efficiency at $220$ cores relative to its baseline $15$ -core configuration.
Intel Xeon Platinum: Experiences scaling degradation at high core counts, yielding approximately $70%$ efficiency at $240$ cores (relative to an $8$ -core baseline) and eventually dropping to $50% - 60%$ efficiency.
AMD EPYC: Achieves superior raw performance at low core counts but encounters systems-level or architectural limitations as it scales, dropping to $66%$ efficiency at $128$ cores relative to a $16$ -core baseline.

My Knowledge Base

Explorer

5 Multicore

Multicore

Interconnection Strategies

Performance Scaling