Multicore

Modern high-performance server scaling relies almost exclusively on multicore processor architectures. To address varying workloads and multi-chip scalability, leading architectures employ distinct organizational approaches for cores, caches, and memory interfaces:

  • Intel Xeon Platinum:
    • Utilizes a deeply integrated microarchitecture with up to cores per chip or up to cores in a -chip chiplet configuration.
    • Features a distributed MiB Last Level Cache (LLC) connected by multiple internal routing rings.
    • Operates at slightly lower clock rates than desktop equivalents to remain within stringent thermal power limits.
  • IBM Power10:
    • Houses up to cores per chip, with each core directly coupled to an MiB L3 cache bank.
    • Connects distributed L3 caches and up to independent memory channels via parallel on-chip routing rings.
  • AMD EPYC Milan:
    • Utilizes a highly modular chiplet design, incorporating cores and a shared LLC per chiplet.
    • Standard sockets contain chiplets, maintaining a bidirectional -byte wide ring interconnect within each chiplet.
FeatureIBM Power10Intel Xeon PlatinumAMD EPYC
Cores/chip;MCM4–15 per chip; 8–30 per socket4–60 per chip; 8–120 per socket8 per chip; 64 per socket
MultithreadingSMTSMTSMT
Threads/core822
Clock rate4.15 GHz2.2–3.6 GHz3.5 GHz
L1 I cache96 KiB per core32 KiB per core32 KiB per core
L1 D cache64 KiB per core32 KiB per core32 KiB per core
L2 cache2 MiB per core1 MiB per core512 KiB per core
L3 cache8 MiB per core; shared with nonuniform access time22–77 MiB @ 1.375 MiB per core; shared, with larger core counts4 MiB per core to 12 MiB per core using stacked memory; shared
InclusionL2 inclusiveL2 inclusiveL3 for shared blocks
Multicore coherence protocolExtended MESI with behavioral and locality hints (13-states)MESIF: extended MESI allowing direct transfers of clean blocksMDOEFSI: extended MOESI with dirty and forward states
Multichip coherenceHybrid: primarily snooping with limited directoryHybrid: snooping and directory at L3Hybrid: snooping and directory (copy of L2 tags) at L3
Multiprocessor interconnectUp to 16 processor chips with 1 or 2 hops to any processorUp to 8 processor chips via UPIInfinity Fabric connects 4 chiplets per socket, up to 2 sockets
Processor chip range1–162–81–2
Core count range4–2408–4808–128

Interconnection Strategies

To scale beyond a single processor, multicore architectures implement sophisticated inter-chip topologies that inherently create Non-Uniform Cache Access (NUCA) and Non-Uniform Memory Access (NUMA) environments:

  • IBM Power10 Interconnect:
    • Supports up to discrete chips ( cores total).
    • Groups processor chips into a fully connected module via intragroup links.
    • Intergroup links connect each chip to three other modules, ensuring every processor is reachable within one or two network hops.
  • Intel Xeon UPI:
    • Uses Ultra Path Interconnect (UPI) links per processor to interface with neighboring chips.
    • Supports up to sockets (up to cores), routed so that any socket is at most two hops away from any other.
  • AMD Infinity Fabric:
    • Provides full connectivity among the to chiplets residing within a single socket.
    • Dual-socket configurations use remaining Infinity Fabric lanes to bridge the sockets, maintaining the one-to-two hop maximum distance across the entire system.

IBM Power chip: up to 15 cores each with 2 MiB L2 and 8 MiB L3 bank, unified on-chip coherence and data interconnect, off-chip intragroup and intergroup SMP links, Open CAPI accelerator interface, PCIe Gen 5 I/O, and 16×8 memory channels.

IBM Power8 multi-chip topology: up to 16 chips arranged in groups of 4. Within a group, chips are fully connected via a 78.4 GB/s intragroup bus. Groups connect to one another via 25.6 GB/s intergroup cables, keeping any chip within two hops of any other.

Performance Scaling

Scaling behavior for independent, multiprogrammed tasks depends heavily on the underlying interconnect and memory bandwidth rather than cache coherency overhead. Measured using the SPECintRate benchmark up to cores, scaling efficiencies diverge significantly among architectures:

  • IBM Power10: Exhibits near-linear scalability, achieving efficiency at cores relative to its baseline -core configuration.
  • Intel Xeon Platinum: Experiences scaling degradation at high core counts, yielding approximately efficiency at cores (relative to an -core baseline) and eventually dropping to efficiency.
  • AMD EPYC: Achieves superior raw performance at low core counts but encounters systems-level or architectural limitations as it scales, dropping to efficiency at cores relative to a -core baseline.