Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Evaluation of AMD EPYCChris Hollowell <[email protected]>
HEPiX Fall 2018, PIC Spain
2
What is EPYC?
EPYC is a new line of x86_64 server CPUs from AMD based on their Zen microarchitecture
Same microarchitecture used in their Ryzen desktop processorsReleased June 2017
First new high performance series of server CPUs offeredby AMD since 2012
Last were Piledriver-based OpteronsSteamroller Opteron products cancelled
AMD had focused on low power server CPUs instead
x86_64 Jaguar APUsARM-based Opteron A CPUs
Many vendors are now offering EPYC-based servers, including Dell, HP and Supermicro
3
How Does EPYC Differ From Skylake-SP?
Intel’s Skylake-SP Xeon x86_64 server CPU line also released in 2017
Both Skylake-SP and EPYC CPU dies manufactured using 14 nm process
Skylake-SP introduced AVX512 vector instruction support in XeonAVX512 not available in EPYCHS06 official GCC compilation options exclude autovectorizationStock SL6/7 GCC doesn’t support AVX512
Support added in GCC 4.9+ Not heavily used (yet) in HEP/NP offline computing
Both have models supporting 2666 MHz DDR4 memorySkylake-SP
6 memory channels per processor3 TB (2-socket system, extended memory models)
EPYC8 memory channels per processor4 TB (2-socket system)
4
How Does EPYC Differ From Skylake (Cont)?Some Skylake-SP processors include built in Omnipath networking,or FPGA coprocessors
Not available in EPYC
Both Skylake-SP and EPYC have SMT (HT) support2 logical cores per physical core (absent in some Xeon Bronze models)
Maximum core count (per socket)Skylake-SP – 28 physical / 56 logical (Xeon Platinum 8180M) EPYC – 32 physical / 64 logical (EPYC 7601)
Maximum socket countSkylake-SP – 8 (Xeon Platinum)EPYC – 2
Processor InteconnectSkylake-SP – UltraPath Interconnect (UPI)EYPC – Infinity Fabric (IF)
PCIe lanes (2-socket system)Skylake-SP – 96EPYC – 128 (some used by SoC functionality)
Same number available in single socket configuration
5
EPYC: MCM/SoC Design
EPYC utilizes an SoC designMany functions normally found in motherboardchipset on the CPU
SATA controllersUSB controllers etc.
Each EPYC processor consists of four CPU dies,interconnected via Infinity Fabric
Multi-Chip Module (MCM) architecture”CPU Complexes” (CCX)
Each CCX attached to its own memory2 memory channels per CCX
All Skylake-SP cores are on a single die
AMD claims MCM results in a cost reduction by improving yields
Believed to scale better than monolithic die approach as core counts continueto increaseDrawback: higher memory latency for non-NUMA-aware applications
6
EPYC: MCM/SoC Design (Cont.)# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 64On-line CPU(s) list: 0-63Thread(s) per core: 2Core(s) per socket: 16Socket(s): 2NUMA node(s): 8Vendor ID: AuthenticAMDCPU family: 23Model: 1Model name: AMD EPYC 7351 16-Core ProcessorStepping: 2CPU MHz: 2400.000CPU max MHz: 2400.0000CPU min MHz: 1200.0000BogoMIPS: 4799.41Virtualization: AMD-VL1d cache: 32KL1i cache: 64KL2 cache: 512KL3 cache: 8192KNUMA node0 CPU(s): 0-3,32-35NUMA node1 CPU(s): 4-7,36-39NUMA node2 CPU(s): 8-11,40-43NUMA node3 CPU(s): 12-15,44-47NUMA node4 CPU(s): 16-19,48-51NUMA node5 CPU(s): 20-23,52-55NUMA node6 CPU(s): 24-27,56-59NUMA node7 CPU(s): 28-31,60-63
# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 72On-line CPU(s) list: 0-71Thread(s) per core: 2Core(s) per socket: 18Socket(s): 2NUMA node(s): 2Vendor ID: GenuineIntelCPU family: 6Model: 85Model name: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHzStepping: 4CPU MHz: 2700.000BogoMIPS: 5404.41Virtualization: VT-xL1d cache: 32KL1i cache: 32KL2 cache: 1024KL3 cache: 25344KNUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71
EPYC vs Skylake-SP (SNC Disabled) NUMA Configuration
7
Socket LGA 3647 & SP3
Both CPUs/sockets are quite large
Visible quadrants in the SP3 socket for the four CPU complexes in the EPYC processor
Skylake SP – Socket LGA 3647 EPYC – Socket SP3
8
Skylake and EPYC Model Lineup ComparisonModel Base Frequency Cores SMT TDP Memory Retail
Xeon Bronze 3104 1.7 GHz (no turbo) 6 No 85W 2133 MHz DDR4 $213
Xeon Silver 4110 2.1 GHz 8 Yes 85W 2400 MHz DDR4 $501
Xeon Gold 5115 2.4 GHz 10 Yes 85W 2666 MHz DDR4 $1,221
Xeon Gold 6130 2.1 GHz 16 Yes 125W 2666 MHz DDR4 $1,900
Xeon Gold 6136 3.0 GHz 12 Yes 150W 2666 MHz DDR4 $2,460
Xeon Gold 6148 2.4 GHz 20 Yes 150W 2666 MHz DDR4 $3,072
Xeon Gold 6150 2.7 GHz 18 Yes 165W 2666 MHz DDR4 $3,358
Xeon Platinum 8170 2.1 GHz 28 Yes 165W 2666 MHz DDR4 $7,405
Xeon Platinum 8180M 2.5 GHz 28 Yes 205W 2666 MHz DDR4 $13,011
EPYC 7251 2.1 GHz 8 Yes 120W 2400 MHz DDR4 $475
EPYC 7351 2.4 GHz 16 Yes 170W 2666 MHz DDR4 $1,110Uniprocessor (P) - $750
EPYC 7401 2.0 GHz 24 Yes 170W 2666 MHz DDR4 $1,850Uniprocessor (P) - $1,075
EPYC 7451 2.3 GHz 24 Yes 180W 2666 MHz DDR4 $2,400
EPYC 7551 2.0 GHz 32 Yes 180W 2666 MHz DDR4 $3,400
EPYC 7601 2.2 GHz 32 Yes 180W 2666 MHz DDR4 $4,200
9
EPYC vs Skylake-SP: HEP/NP PerformanceBenchmarks
HEPSPEC06“all_cpp” subset of SPEC-CPU2006 run in parallel
CERN Cloud Benchmark SuiteVarious benchmarks, run in parallel
DB12WhetstoneATLAS KV
Unless noted, memory configured to utilize all 8 channels per CPU on EPYC, and 6 channels per CPU for Skylake-SP, with at least 2 GB RAM/logical core
~11% HS06 performance degradation seen for EPYC 7441 whenonly populating half of the memory channelsAll 2666 MHz DDR4
Noted dual rank (DR) DIMMs downclocked to 2400 MHz for EPYC
All run under SL/CentOS/RHEL 7
SMT/Hyperthreading enabled, unless otherwise indicated
Systems are dual CPU, unless noted
10
EPYC HEPSPEC06: SMT Off vs On
CPU0
200
400
600
800
1000
1200
1400
368
489541
780
883
1101
872
1148
1078
1296
EPYC [email protected] GHz [Uniprocessor - 24 threads]EPYC [email protected] GHz [Uniprocessor - 48 threads]
EPYC [email protected] GHz [32 threads]EPYC [email protected] GHz [64 threads]EPYC [email protected] Ghz [48 threads] DDR-2400EPYC [email protected] Ghz [96 threads] DDR-2400EPYC [email protected] GHz [64 threads]EPYC [email protected] GHz [128 threads]EPYC [email protected] GHz [64 threads] DDR4-2400EPYC [email protected] GHz [128 threads] DDR4-2400
EP
YC
7401P
SM
T
EP
YC
7351
EP
YC
7351 S
MT
EP
YC
7451 S
MT EP
YC
7551
EP
YC
7551 S
MT
EP
YC
7601
EP
YC
7601 S
MT
EP
YC
7451
HS06
EP
YC
7401P
25%+ HS06 performance improvement with SMT (“hyperthreading”) enabled
11
EPYC vs Skylake-SP: HEPSPEC06
CPU0
200
400
600
800
1000
1200
1400
394
729
790
10681035
1261
489
780
11011148
1296
Xeon Gold [email protected] GHz [40 threads] +Xeon Gold [email protected] GHz [64 threads]Xeon Gold [email protected] GHz [48 threads]Xeon Gold [email protected] GHz [80 threads]Xeon Gold [email protected] GHz [72 threads]Xeon Platinum [email protected] GHz [104 threads] *EPYC [email protected] GHz [Uniprocessor - 48 threads]EPYC [email protected] GHz [64 threads]EPYC [email protected] Ghz [96 threads] DDR-2400
EPYC [email protected] GHz [128 threads]EPYC [email protected] GHz [128 threads] DDR4-2400
Xeon
Gold 61 3
0
Xeon
Gold 61 3
6
Xeon
Gold 61 4
8
EP
YC
7401P
Xeon
Platinu
m 817
0
EP
YC
7351
EP
YC
7451
EP
YC
7551
EP
YC
7601
Xeon
Gold 61 5
0
HS06
+ = System using only 3 memory channels per CPU* = Value reported by CERN
Xe
on Gold 51
15
12
EPYC vs Skylake: HEPSPEC06 (Cont.)
Larger values are better
Similar maximum HS06 (~1,275) performance for the models testedData for highest level EPYC (7601), but not highest model Skylake-SP (8180M)Can assume Xeon Skylake 8180M would perform better than the 8170 value listed
Same number of cores/threads as 8170, but higher clock speed2.5 GHz vs 2.1 GHz
Mid-range model HS06 performance also similar~700 HS06 - ~1100 HS06
TDP somewhat higher for EPYC CPUs vs Xeon Gold, in general165 W max Xeon Gold, vs 180 W max EPYCCan likely expect EPYC to use a bit more power as a result
13
EPYC vs Skylake-SP: CERN Cloud Benchmarks
DB12 (aggregate) Whetstone (aggregate) ATLAS KV (aggregate)0
200
400
600
800
1000
1200
1400
220
114
15
998
262
65
733
210
67
1256
361
120
Xeon Gold [email protected] GHz [40 threads] +Xeon Gold [email protected] GHz [72 threads]EPYC [email protected] GHz [64 threads]EPYC [email protected] GHz [128 threads]
Xeon
Gold 61 5
0
+ = System using only 3 memory channels per CPU
Xe
on Gold
5115
EP
YC
7351
EP
YC
7551
Xeon G
old 51 15
Xeon
Gold 61 50
EP
YC
7351
EP
YC
7551
Xeon G
old 5115
Xeon G
old 61 50
EP
YC
7351
EP
YC
7551
Dirac HS06 Est.
Events/Sec
BWIPS
14
Results for a limited number of CPUs: only possible to run the full suite(including KV) on systems with CVMFS client installed/setup
By default the CERN cloud benchmarks run one instance of a benchmarkper logical core in parallel
However, only reports performance per logical coreInterested in aggregate system performance, not performance/logical core
For DB12, and Whetstone, simply multiplied result by number of logicalcoresFor KV, average seconds/event per logical core is reported
Took the inversion, and multiplied by the number of logical coresto obtain total events/sec
Larger graphed values are better
DB12 and Whestone results fairly in line with HS06
Expected somewhat better KV performance for the Xeon Gold 6150
EPYC/Skylake: CERN Cloud Benchmarks (Cont.)
15
EPYC vs Skylake-SP: CPU HS06/Dollar
CPU0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.32
0.19
0.160.17
0.15
0.09
0.45
0.35
0.23
0.17
0.15
Xeon Gold [email protected] GHz [40 threads] +
Xeon Gold [email protected] GHz [64 threads]Xeon Gold [email protected] GHz [48 threads]Xeon Gold [email protected] GHz [80 threads]Xeon Gold [email protected] GHz [72 threads]Xeon Platinum [email protected] GHz [104 threads]EPYC [email protected] GHz [Uniprocessor - 48 threads]EPYC [email protected] GHz [64 threads]EPYC [email protected] Ghz [96 threads] DDR-2400EPYC [email protected] GHz [128 threads]EPYC [email protected] GHz [128 threads] DDR4-2400
Xe
on Gold 61
30
Xe
on Gold 61
36
Xeon
Gold 61 4
8
EP
YC
7401P
Xe
on P
latinum
8170
EP
YC
7351
EP
YC
7451 EP
YC
7551
EP
YC
7601
Xeon G
old 61 50
HS06/$
Only retail CPU cost accounted for in calculationsDoes not represent reality given memory and base server pricing
+ = System using only 3 memory channels per CPU
Xeon
Gold 51 1
5
16
EPYC vs Skylake-SP: Estimated 25kHS06 Cost
CPU0
50
100
150
200
250
300
350
400
450
332
256 258
233
257
417
256
189
221
256 251
Xeon Gold [email protected] GHz [6*16 GB DIMMs, 64 servers] +
Xeon Gold [email protected] GHz [12*16 GB DIMMS, 34 servers]
Xeon Gold [email protected] GHz [12*8 GB DIMMs, 32 servers]
Xeon Gold [email protected] GHz [12*16 GB DIMMs, 23 servers]
Xeon Gold [email protected] GHz [12*16 GB DIMMs, 24 servers]
Xeon Platinum [email protected] GHz [12*32 GB DIMMS, 20 servers]
EPYC [email protected] GHz [8*16 GB DIMMs, 51 servers]
EPYC [email protected] GHz [16*8 GB DIMMs, 32 servers]
EPYC [email protected] GHz [16*16GB DIMMs, 23 servers]
EPYC [email protected] GHz [16*16 GB, 22 servers]]
EPYC [email protected] GHz [16*16 GB, 19 servers]
Xeon
Gold 61 3
0
Xe
on Gold 61
36
Xeon
Gold 61 4
8
EP
YC
7401P
Xe
on Platinu
m 81
70
EP
YC
7351
EP
YC
7451
EP
YC
7551
EP
YC
7601
Xe
on Gold 61
50
$1k
Estimated total cost of 25kHS06 +-500HS06Assuming $1,500 irreducible server cost, and retail CPU/memory pricing
Xe
on Gold 51
15
+ = System using only 3 memory channels per CPU
17
Server counts to achieve 25kHS06 +-500HS06 (+-2%)
Majority of compute node cost in CPU and memoryAssuming no excessive local storage space or IOPs requirements
Typically the case for HEP/NP
Retail CPU costs used in estimateLikely to receive volume or competitive discounts
Estimate assumes a server without CPUs and memory costs $1,500Includes power supply, disk, NIC, etc.Only accounts for cost of servers themselves. Associated costs suchas racks, network switches, integration, shipping, etc. not includedServer vendors typically increase base prices for servers which supporthigher performing CPU models with higher TDP (i.e. due to bigger PSUs, etc.)
$1,500 server base cost may be lower than reality for systems with higher end CPUs
Memory costs: retail Samsung server DIMM pricingDDR4 2666 MHz, ECC, registered
8 GB DIMM - $13716 GB DIMM - $20832 GB DIMM - $380
EPYC vs Skylake-SP: Est. 25kHS06 Cost (Cont.)
18
Enough memory populated per system to provide 2 GB/logical coreFairly standard for HEP/NP
Ensured all 6 (Skylake) or 8 (EPYC) memory channels per CPU utilized formaximum bandwidth, and NUMA performance
Often ended up with more RAM than required to satisfy 2 GB/logical core6 channels makes this particularly difficult for Skylake: installed memory not a power of two
Problem compounded by server manufacturers not offering smaller(i.e. 4 GB), less expensive DDR4 DIMMs
Estimated cost for 25kHS06 fairly similar between Skylake-SP Xeon Gold and EPYC servers: most close to $250k
Dual-CPU EPYC 7351 systems appear very cost effective, however: est. $189k~25% less than the Xeon Gold 6148 Required memory/logical core and DIMM channel parityCPU itself is inexpensive compared to other Skylake/EPYC counterparts
EPYC 7401P uniprocessor system HS06/$ for CPU cost initially looked promising
Large number of servers required: irreducible per server cost added upEstimated in the $250k range like many of the other CPUs
EPYC vs Skylake-SP: Est. 25kHS06 Cost (Cont.)
19
Side-Channel Attacks
Jan 2018 - New class of side-channel information disclosure vulnerabilities in CPU hardware made public
Meltdown, SpectreExploit speculative execution and caching optimizations in CPUs
List of similar side-channel attack vectors continues togrow
Speculative Store Bypass VulnerabilityForeshadow (L1TF)
Meltdown Spectre
Microcode updates for Skylake-SP released for all of the above
AMD claims EPYC is not vulnerable to Meltdown or ForeshadowDue to existing protections in their paging architectureReleased microcode updates for Spectre
20
The Future: EPYC 2 and Cascade LakeEPYC 2 - Rome
Expected in early 20197 nm processSupport for DDR4-3200 MHz DIMMs expected
Still 8 channelsMax core count per socket increased to 64 (128 threads)AVX512?
Cascade Lake XeonExpected end of 201814 nm process
10 nm process expected in Ice Lake in 2020Max memory speed: DDR4-2600 DIMMs
Still 6 channelsMax core count per socket remains at 28 (56 threads)Expected to support VNNI instructions
Utilizes AVX512 unitsSupport for Optane 3D Xpoint memoryAnnounced that the Frontera supercomputer at TACC will be Cascade Lake based – estimated to provide 35-40 PFLOPS without GPUs
Both CPUs will have Spectre mitigations built into hardware
21
Conclusions
The EPYC MCM architecture is considerably different from Skylake-SP’s single die configuration
Similar HEP/NP benchmark performance from mid/upper range Skylake-SP Xeon Gold and mid/upper range AMD EPYC CPUs
Pricing also similarHowever, dual-CPU EPYC 7351-based systems appear to be a sweet spot for applications requiring 2GB/logical core (somewhattypical for HEP/NP software)
Competition in the server CPU market will likely help reduce cost and spur innovation
EPYC2 (Rome) with its 7nm process and 64 physical cores appears poised to disrupt the existing balance of server CPU market share