Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Intel Korea Sep 2017
2
Notices and DisclaimersThis document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
No computer system can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.
Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
Intel, the Intel logo, Intel Optane and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the united states and other countries.
* Other names and brands may be claimed as the property of others. © 2017 Intel Corporation.
4@IntelAI
The coming flood of dataBy 2020…
The average internet user will generate
~1.5 GB of traffic per daySmart hospitals will be generating over
3,000 GB per daySelf driving cars will be generating over
4,000 GB per day… each
All numbers are approximatedhttp://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttps://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html
Self driving cars will be generating over
4,000 GB per day… eachA connected plane will be generating over
40,000 GB per dayA connected factory will be generating over
1,000,000 GB per day
radar ~10-100 KB per second
sonar ~10-100 KB per second
gps ~50 KB per second
lidar ~10-70 MB per second
cameras ~20-40 MB per second
1 car 5 exaflops per hour
5
The next big wave of computing
Source: Intel
AI Compute Cycles will grow by 2020 12X
mainframes Standards-based servers
Cloud computing
Artificial intelligence
Data delugeCOMPUTE breakthroughInnovation surge
6
Ai is transforming industries
Consumer Health Finance Retail Government Energy Transport Industrial OtherSmart
Assistants
Chatbots
Search
Personalization
Augmented Reality
Robots
Enhanced Diagnostics
Drug Discovery
Patient Care
Research
Sensory Aids
Algorithmic Trading
Fraud Detection
Research
Personal Finance
Risk Mitigation
Support
Experience
Marketing
Merchandising
Loyalty
Supply Chain
Security
Defense
Data Insights
Safety & Security
Resident Engagement
Smarter Cities
Oil & Gas Exploration
Smart Grid
Operational Improvement
Conservation
Automated Cars
Automated Trucking
Aerospace
Shipping
Search & Rescue
Factory Automation
Predictive Maintenance
PrecisionAgriculture
Field Automation
Advertising
Education
Gaming
Professional & IT Services
Telco/Media
Sports
Source: Intel forecast
exam
ples
Early adoption
7
A common language for AI Today
Memory based
Machine Learning
remember
act ADAPTreasonSENSE
Reasoning systems
Classic ML
Artificial intelligence
Deep learning Logic based
8
libraries Intel® MKL MKL-DNN Intel® MLSL
Intel® Deep Learning SDKtools
Frameworks
Intel® DAAL
hardwareMemory & Storage NetworkingCompute
Intel Dist
Mlib BigDL
Intel® Nervana™ Graph*
intel AI portfolioexperiences
Movidius MvTensor
Library
Associative Memory Base
E2E Tool
Lake Crest
Intel® Computer Vision SDK
Visual Intelligence*Coming 2017
*
Movidius Neural Compute Stick
9
✝Codename for product that is coming soonAll performance positioning claims are relative to other processor technologies in Intel’s AI datacenter portfolio*Knights Mill (KNM); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc.All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
AI DatacenterAll purpose Highly-parallel Flexible acceleration Deep Learning
Crest Family✝
Deep learning by designScalable acceleration with
best performance for intensive deep learning
training & inference
Intel®FPGA
Enhanced DL InferenceScalable acceleration for
deep learning inference in real-time with higher
efficiency, and wide range of workloads & configurations
Intel® Xeon® Processor Family
most agile AI PlatformScalable performance for
widest variety of AI & other datacenter workloads –including deep learning
training & inference
Intel® Xeon Phi™ Processor (Knights Mill✝)
Faster DL TrainingScalable performance
optimized for even faster deep learning training and
select highly-parallel datacenter workloads*
✝
Astrophysics Manufacturing
Energy SecurityFinancial
Life Sciences Climate
Weather
11
12
Growing Challenges in HPC
System Bottlenecks“The Walls”
Memory | I/O | StorageEnergy Efficient Performance
Space | Resiliency | Unoptimized Software
Divergent Workloads
Barriers to Extending Usage
Resources Split Among Modeling and Simulation | Big
Data Analytics | Machine Learning | Visualization
Democratization at Every Scale | Cloud Access |
Exploration of New Parallel Programming Models
BigDatahpc
Machine learning
visualization
OptimizingFor Cloud
13
Intel® Scalable System Framework
Many Workloads – one Framework
Modeling & Simulation Machine Learning VisualizationHPC Data Analytics
A Flexible Framework for Today & Tomorrow
Enabling Breakthrough
System Performance
Intel® Xeon® Processor Roadmap
Intel® Xeon® Processor E5Targeted at a wide variety of applications that value a balanced system with leadership performance/watt/$
18 cores
Intel® Xeon® Processor E7Targeted at mission critical applications that value a scale-up system with leadership memory capacity and advanced RAS
Grantley-EP Platform
E5 v3 E5-2600 v4
Brickland Platform
E7 v3 E7 v4
Purley Platform
Skylake
E5 v3 E5-4600 v4 (4S)
Cascade Lake
2016 2017 2018
Intel Xeon GOLD
Intel® Xeon® PLATINUM
Intel Xeon SILVER
Intel Xeon BRONZE
Converged platform with innovative Skylake-SP microarchitecture14
Intel® Xeon® Scalable Platform Feature Overview
Skylake-SP CPU
Skylake-SP CPU
2 or 3 Intel® UPI3x16 PCIe Gen3
3x16 PCIe* Gen3
DDR42666
Lewisburg PCH
4x10GbE NIC
Intel®QAT MEIE
High Speed IO
USB3
PCIe3SATA3
GPIOBMC
eSPI/LPCFirmware
FirmwareTPM
SPI10GbE
CPU VRs
OPA VRs
Mem VRs
OPA
DMI
OPA1x 100Gb OPA Fabric
1x 100Gb OPA Fabric
BMC: Baseboard Management Controller PCH: Intel® Platform Controller Hub IE: Innovation Engine
Intel® OPA: Intel® Omni-Path Architecture Intel QAT: Intel® QuickAssist Technology ME: Manageability Engine
NIC: Network Interface Controller VMD: Volume Management Device NTB: Non-Transparent Bridge
Feature Details
Socket Socket P
Scalability 2S, 4S, 8S, and >8S (with node controller support)
CPU TDP 70W – 205W
Chipset Intel® C620 Series (code name Lewisburg)
Networking Intel® Omni-Path Fabric (integrated or discrete)4x10GbE (integrated w/ chipset)100G/40G/25G discrete options
Compression and Crypto Acceleration
Intel® QuickAssist Technology to support 100Gb/s comp/decomp/crypto 100K RSA2K public key
Storage Integrated QuickData Technology, VMD, and NTBIntel® Optane™ SSD, Intel® 3D-NAND NVMe &SATA SSD
Security CPU instruction enhancements (MBE, PPK, MPX)Manageability EngineIntel® Platform Trust TechnologyIntel® Key Protection Technology
Manageability Innovation Engine (IE)Intel® Node ManagerIntel® Datacenter Manager
15
DMI x4**
Platform Topologies8S Configuration
SKLSKL
LBG
LBG
LBG
DMI
LBG
SKLSKL
SKLSKL
SKLSKL
3x16 PCIe*
4S Configurations
SKLSKL
SKLSKL
2S Configurations
SKLSKL
(4S-2UPI & 4S-3UPI shown)
(2S-2UPI & 2S-3UPI shown)
Intel®UPI
LBG 3x16 PCIe* 1x100G
Intel® OP Fabric
3x16 PCIe* 1x100G
Intel® OP Fabric
LBGLBG
LBG
DMI
3x16 PCIe*
Intel® Xeon® Scalable Processor supports configurations ranging from 2S-2UPI to 8S
16
Skylake core microarchitecture with data center specific enhancements
Intel® AVX-512 with 32 DP flops per core
Data center optimized cache hierarchy –1MB L2 per core, non-inclusive L3
New mesh interconnect architecture
Enhanced memory subsystem
Modular IO with integrated devices
New Intel® Ultra Path Interconnect (Intel® UPI)
Intel® Speed Shift Technology
Security & Virtualization enhancements (MBE, PPK, MPX)
Optional Integrated Intel® Omni-Path Fabric (Intel® OPA)
Intel® Xeon® Scalable ProcessorsRe-architected from the ground up
Features Intel® Xeon® Processor E5-2600 v4 Intel® Xeon® Scalable Processor
Cores Per Socket Up to 22 Up to 28
Threads Per Socket Up to 44 threads Up to 56 threads
Last-level Cache (LLC) Up to 55 MB Up to 38.5 MB (non-inclusive)
QPI/UPI Speed (GT/s) 2x QPI channels @ 9.6 GT/s Up to 3x UPI @ 10.4 GT/s
PCIe* Lanes/ Controllers/Speed(GT/s)
40 / 10 / PCIe* 3.0 (2.5, 5, 8 GT/s) 48 / 12 / PCIe 3.0 (2.5, 5, 8 GT/s)
Memory Population4 channels of up to 3 RDIMMs,
LRDIMMs, or 3DS LRDIMMs6 channels of up to 2 RDIMMs,
LRDIMMs, or 3DS LRDIMMs
Max Memory Speed Up to 2400 Up to 2666
TDP (W) 55W-145W 70W-205W
Core Core
Core Core
Core Core
Shared L3
UPI
UPI
2 or 3 UPI
6 Channels DDR4
48 Lanes PCIe* 3.0
DMI3
DDR4
DDR4
DDR4
DDR4
DDR4
DDR4
UPI
Omni-Path HFIOmni-Path
17
Core Microarchitecture Enhancements
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
19
Broadwell uArch
Skylake uArch
Out-of-order Window
192 224
In-flight Loads + Stores
72 + 42 72 + 56
Scheduler Entries 60 97Registers –Integer + FP
168 + 168 180 + 168
Allocation Queue 56 64/thread
L1D BW (B/Cyc) –Load + Store
64 + 32 128 + 64
L2 Unified TLB 4K+2M: 10244K+2M: 1536
1G: 16
Load Buffer
Store Buffer
Reorder Buffer
5
6
Scheduler
Allocate/Rename/RetireIn order
OOO
INT
VE
C
Port 0 Port 1
MUL
ALU
FMA
ShiftALU
LEA
Port 5
ALU
ShuffleALU
LEA
Port 6
JMP 1
ALU
Shift
JMP 2
ALU
ALU
DIVShift
Shift
FMA
Port 4
32KB L1 D$
Port 2
Load/STAStore Data
Port 3
Load/STA
Port 7
STA
Load Data 2
Load Data 3 Memory Control
Fill Buffers
Fill Buffers
μop Cache
32KB L1 I$ Pre decode Inst QDecodersDecodersDecodersDecoders
Branch Prediction Unit
μopQueue
Memory
Front End
1MB L2$
FMA
About 10% performance improvement per core on integer applications at same frequency
• Larger and improved branch predictor, higher throughput decoder, larger window to extract ILP• Improved scheduler and execution engine, improved throughput and latency of divide/sqrt • More load/store bandwidth, deeper load/store buffers, improved prefetcher• Data center specific enhancements: Intel® AVX-512 with 2 FMAs per core, larger 1MB MLC
Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
• 512-bit wide vectors
• 32 operand registers
• 8 64b mask registers
• Embedded broadcast
• Embedded rounding
Microarchitecture Instruction Set SP FLOPs / cycle DP FLOPs / cycle
Skylake Intel® AVX-512 & FMA 64 32
Haswell / Broadwell Intel AVX2 & FMA 32 16
Sandybridge Intel AVX (256b) 16 8
Nehalem SSE (128b) 8 4
Intel AVX-512 Instruction Types
AVX-512-F AVX-512 Foundation Instructions
AVX-512-VL Vector Length Orthogonality : ability to operate on sub-512 vector sizes
AVX-512-BW 512-bit Byte/Word support
AVX-512-DQ Additional D/Q/SP/DP instructions (converts, transcendental support, etc.)
AVX-512-CD Conflict Detect : used in vectorizing loops with potential address conflicts
Powerful instruction set for data-parallel computation
20
Performance and Efficiency with Intel® AVX-512
Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
669
11782034
3259
760 768 791 767
3.12.8
2.5
2.1
0
0.5
1
1.5
2
2.5
3
3.5
0
500
1000
1500
2000
2500
3000
3500
SSE4.2 AVX AVX2 AVX512
Co
re F
req
ue
ncy
GF
LO
Ps,
Sy
ste
m P
ow
er
LINPACK Performance
GFLOPs Power (W) Frequency (GHz)
1.001.74
2.92
4.83
0.00
2.00
4.00
6.00
SSE4.2 AVX AVX2 AVX512No
rma
liz
ed
to
SS
E4
.2
GF
LO
Ps/
Wa
tt
GFLOPs / Watt
1.001.95
3.77
7.19
0.00
2.00
4.00
6.00
8.00
SSE4.2 AVX AVX2 AVX512No
rma
liz
ed
to
SS
E4
.2
GF
LO
Ps/
GH
z
GFLOPs / GHz
Intel® AVX-512 delivers significant performance and efficiency gains
21
22
1.63x Average Gains on High Performance Compute Apps
1.00
1.42
1.87
2.38
1.56 1.58 1.67 1.73 1.75
1.41 1.44 1.52 1.41 1.68
-
0.50
1.00
1.50
2.00
2.50
3.00
Broadwell
(E5-2697 v4)
WRF* HOMME* LSTC
LS-DYNA
Explicit*
INTES
PERMAS* V16
MILC* GROMACS* VASP* NAMD* LAMMPS* Amber GB* Binomial
option pricing
Black-Scholes Monte Carlo
European
options
Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Geomean of
Weather Research Forecasting - Conus 12Km, HOMME, LSTCLS-DYNA Explicit, INTES PERMAS V16, MILC, GROMACS
water 1.5M_pme, VASPSi256, NAMDstmv, LAMMPS, Amber GB Nucleosome, Binomial option pricing, Black-Scholes, Monte
Carlo European options. Any difference in system hardware or software design or configuration may affect actual performance.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter.
Configurations: see page 54,55
Manufacturing Life Sciences FSIEarth Systems Models
Intel® Xeon® Gold 6148 processor
Benefits of Intel® Xeon® processor Scalable family Intel® AVX-512 and Higher IPC/Per Core performance
Balanced IO and memory
Predictable latency with mesh/memory architecture improvements
Additional options to scale performance Intel® Optane™ SSD DC P4800X Series for fast storage
Intel® Omni-Path Fabric to scale cluster performance
Higher is better
23
LSTC LS-DYNA ExplicitApplication / Workload Description:• LS-DYNA is a popular crash simulation application. It is used by the
automobile, aerospace, construction, military, manufacturing, and bioengineering industries worldwide.
Key Takeaway: • All major automakers and aerospace customers can benefit from the
increased performance
• Faster simulation turnover
• Influencing customers to migrate to Intel® AVX-512
Performance Factors:• More cores and threads, 50% more memory bandwidth and an
improved cache hierarchy.
• Additional performance improvement with Intel® AVX-512
Up to YY%faster
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. *Other names and brands may be claimed as the property of others. 1. Testing conducted on ISV* software comparing Intel® Xeon® Gold 6148 processor to 2S Intel® Xeon® Processor E5-2697 v3 and to 2S Intel® Xeon® Processor E5-2697 v4. Testing done by Intel. For complete testing configuration details, see slide 29.
No
rma
lize
d P
erf
orm
an
ce
www.lstc.com
Up to 2.4X
faster
LS-DYNA explicit increased performance by up to 1.79X with the Intel® Xeon® Gold 8164
processor
Up to 23%
faster
Workload: 2M elements Car2car model with 120ms simulation time
Up to 25%
faster
24
Monte Carlo European OptionsApplication / Workload Description:• Monte Carlo is a numerical method that uses statistical sampling techniques to
approximate solutions to quantitative problems. In finance, Monte Carlo algorithms are used to evaluate complex instruments, portfolios, and investments. This is compute bound, double precision workload.
Key Takeaway: • Higher performance allows either doing the same work faster leading to improved
TCO or simulation of more paths leading to higher confidence in results.
Performance Factors:• Using Intel® AVX-512 SIMD vectorization improved performance by 1.85X over
Intel® AVX2.
• Higher core counts of Intel Xeon® Gold 6148 processor contributes to higher performance.
• Better memory hierarchy adds to the performance.
• Code modernization strategy: Parallelizing outer loop over options and vectorizeinner loop of paths.
Up to YY%faster
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. *Other names and brands may be claimed as the property of others. 1 - Testing conducted on Black-Scholes software comparing 2S Intel® Xeon® Processor E5-2697v3 to 2S Intel® Xeon® Processor E5-2697v4 and 2S Intel® Xeon® Gold 6148 processor. Testing done by Intel. For complete testing configuration details, see slide 31.
Up to 1.72Xfaster
0
1
2
2S Intel® Xeon® processor E5-2697 v3
2S Intel® Xeon® processor E5-2697 v4
Intel® Xeon® Gold 6148 processor
Up to 1.3X
faster
Performance Metric: Speed-up using options/sec
Increased Monte Carlo European Option performance with the 2S Intel® Xeon® Gold 6148
processor
No
rma
lize
d P
erf
orm
an
ce
Up to 3.1X
faster
25
LAMMPSApplication / Workload Description:• LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale
Atomic/Molecular Massively Parallel Simulator. It is used to simulate the movement of atoms to develop better therapeutics, improve alternative energy devices, develop new materials, and more.
Key Takeaway: • The improved performance allows for longer time scales, larger simulations,
and/or improved sampling and statistics.
• The continued advances in molecular dynamics performance on Intel® architecture allow computational scientists to solve new and more complex problems.
Performance Factors:• Intel® AVX-512 – Up to 49% gain1 versus Intel® AVX2
Up to YY%faster
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. *Other names and brands may be claimed as the property of others.
Up to 1.72Xfaster
Testing conducted on LAMMPS* code comparing 2S Intel® Xeon® Gold 6148 processor to 2S Intel® Xeon® Processor E5-2697 v3 and to 2S Intel® Xeon® Processor E5-2697 v4. Reported Intel® AVX-512 gains are compared to running an Intel® AVX2 binary using all cores on the same platform. Reported increased number of cores gains are compared to running reduced number of cores on the same platform. Testing done by Intel. For complete testing configuration details, see slide 32.
No
rma
lize
d P
erf
orm
an
ce
0
1
2
2S Intel® Xeon® processor E5-2697 v3
2S Intel® Xeon® processor E5-2697 v4
Intel® Xeon® Gold 6148 processor
Up to 39%
faster
Workload: LAMMPS CG Water Simulation.
lammps.sandia.gov
Up to 2.4X
faster
LAMMPS* increased performance1 with the 2S Intel® Xeon® Gold 6148
26
https://www.3ds.com/
Simulia Abaqus StandardApplication / Workload Description:• Abaqus Standard gives manufacturers an effective way to analyze
static and low-speed dynamic events where precise stress solutions are vital. A single simulation can analyze a model in both the time and frequency domains. Examples include sealing pressure in a gasket joint, steady-state rolling of a tire, or crack propagation in a composite airplane fuselage.
Key Takeaway: • Faster product design time
• Ability to solve more complex models on the same hardware footprint
Performance Factors:• Increased core count, higher frequencies and greater memory
bandwidth of the Intel® Xeon® Platinum 6148 processor were key to the performance gain.
• Intel® AVX-512 provides a 25% gain compared to Intel® AVX
Up to YY%faster
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. *Other names and brands may be claimed as the property of others. 1. Testing conducted on Simulia* software comparing Intel® Xeon® Gold 6148 processor to 2S Intel® Xeon® Processor E5-2698 v3 and to 2S Intel® Xeon® Processor E5-2697 v4 Testing done by Intel. For complete testing configuration details, see slide 33.
No
rma
lize
d P
erf
orm
an
ce Up to 2.4X
faster
Increased Simulia Abaqus Standard performance with the Intel® Xeon® Gold 6148 processor
Up to 23%
faster
Up to 1.80xfaster
Workload: s2b flywheel with centrifugal load
27
Delivering Performance Beyond BenchmarksC
lou
d
1.5Xcloud monitoring4
A C L O M E
1.72Xvideo stitching5
C L O U D
1.74Xclick-through-rate1
S E A R C H
1.62Xenterprise cloud applications2
F U S H I O N S P H E R E
1.63XOLTP database3
M Y S Q L C L O U D S E R V I C E
1.59Xdatabase transactions9
H A N A
1.72Xmolecular dynamics8
1.47Xin-memory analytics6
D B 2
1.68Xenterprise risk management7
A N A L Y T I C S R I S K E N G I N E
2Xbusiness analytics10
AI
&
An
aly
tics
1.5Xvideo transcoding13
M E D I A F I R S T
2.21Xbusiness support system11
V E R I S
1.67Xrouting15
V I R T U A L B N G
1.9XHEVC video encoding12
E B L I V E
1.64Xpacket inspection14
V I R T U A L S E R I E S
Ne
two
rk
28
1.65x Average1 Generational Gains on 2-Socket Servers
with Intel® Xeon® Scalable Processor
1.33 1.40 1.441.53 1.58 1.65 1.65 1.73 1.73 1.77
1.87
2.27
0
0.5
1
1.5
2
2.5
E5-26xx v4Baseline
TPC*-E SPECvirt_sc* 2013
Two-tierSAP SD*(Linux)
SPECint*_rate
_base2006
SPECjbb2015*MultiJVM
critical-jOPS
SPECfp*_rate
_base2006
STREAM*Triad
HammerDB* LAMMPS DPDK L3Packet
Forwarding
BlackScholes
Intel®Distribution
for LINPACK
1-node 2x Intel® Xeon® processor E5-26xx v4 ("Broadwell-EP 2S") 1-node 2x Intel® Xeon® Scalable processor
Infr
astr
uctu
re A
pp
virt
ualiz
atio
n
Gen
eral
In
tege
r A
pp
Thr
ough
put
OLT
P D
atab
ase
Per
form
ance
Tec
hnic
al C
ompu
te A
pp
Thr
ough
put
LIN
PA
CK
Thr
ough
put
Mem
ory
Ban
dwid
th
Rela
tive 2
S P
erf
orm
ance
Higher is better
Java
* B
usin
ess
Ops
Crit
ical
jOP
S
Bro
kera
ge F
irm O
LTP
Average 1.65
Ent
erpr
ise
Sal
es a
nd
Dis
trib
utio
n (L
inux
)
Net
wor
k L3
Pac
ket
For
war
ding
HP
C –
Mol
ecul
ar D
ynam
ics
FS
I –O
ptio
ns P
ricin
g
1 Geomean based on Normalized Generational Performance (estimates based on Intel internal testing and published results of TPC-E, SPECvirt_sc*2013, SAP SD 2-Tier, SPEC*int_rate_base2006, SPEC*fp_rate_base2006, SPECjbb2015* MultiJVM, STREAM* triad, HammerDB, LAMMPS, DPDK L3 Packet Forwarding, Black-Scholes, Intel Distribution for LINPACK.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. Configurations: see slides 15, 16. *Other names and brands may be claimed as the property of others.
Intel® Xeon® Interconnect Journey
30
Cbo
UPDNLLC Core
Cbo
UPDNLLC Core
Cbo
UPDNLLC Core
Cbo
UPDNLLC Core
UP
DN
Cbo
Core LLCUP DN
Cbo
Core LLCUP DN
Cbo
Core LLCUP DN
Cbo
Core LLCUP DN
CSI Agent2x20, 9.6 GT/s
Home
Agent
CSI
3.6 GHz, 32B/dir = 115.2 GB/s/dir/hop
UP
DN
DDR3
Mem
Ctlr4ch, 1600,
8B
12.8 GB/s
12.8 GB/s
12.8 GB/s
12.8 GB/s
UP
IIOIOAPIC DMA
X4
DM
I
PCI-E
X16
16
GB
/s/d
ir
DN
PCI-E
X16
PCI-E
X8DMI
16
GB
/s/d
ir
8 G
B/s
/dir
UP
CSI
DN
19.
2 G
B/s
/dir
19.
2 G
B/s
/dir
UP
DN
UP
DN
UBox PCU
IDI/CSI
Msg Ch
Globally routed
CBO
IDI/Q
PII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
DNUP
D
N
U
P
D
N
U
P
CBO
IDI/Q
PII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
QPI Agent
QPI
Link
R3QPI
QPI
Link
IIO
R2PCI
PCI-E
X16
IOAPIC
CB DMA
PCI-E
X16
PCI-E
X8
PCI-E
X4 (ESI)UBoxPCU
Home AgentDDR
Mem CtlrDDR
Home AgentDDR
Mem CtlrDDR
2009: Intel® Xeon® processor X75xx, 45nm
2011: Intel® Xeon® processor E7, 32nm 2012: Intel® Xeon® processor E5, 32nm
2013: Intel® Xeon® processor E7 v2, 22nm 2014: Intel® Xeon® processor E7 v3, 22nm
2016: Intel® Xeon® processor E7 v4, 14nm
2016: Intel® Xeon® processor E7 v4, 14nm(Broadwell EX 24-core die)
2017: Intel® Xeon® Scalable Processor, 14nm(Skylake-SP 28-core die)
CBO
IDI/QPII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/QPII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
IDI/QPII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/QPII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
DNUP
D
N
U
P
D
N
U
P
CBO
IDI/QPII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/QPII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/QPII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/QPII
IDI
CoreCore
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
IDI/Q
PII
IDI Core
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/QPII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/QPII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/QPII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/QPII
IDI U
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
CBO
D
N
IDI/Q
PII
IDIU
PCore
Core
BoCache
BoSAD
LLC2.5MB
QPI Agent
QPI
Link
R3QPI
QPI
Link
IIO
R2PCI
PCI-E
X16
IOAPIC
CB DMA
PCI-E
X16
PCI-E
X8
PCI-E
X4 (ESI)UBoxPCU
Home AgentDDR
Mem CtlrDDR
Home AgentDDR
Mem CtlrDDR
31
*2x UPI x20 PCIe* x16 PCIe x16
DMI x4CBDMA
On Pkg
PCIe x16
1x UPI x20 PCIe x16
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
CHA/SF/LLC
SKX Core
MCDDR4
DDR4
DDR4
MC DDR4
DDR4
DDR4
CHA – Caching and Home Agent; SF – Snoop Filter; LLC – Last Level Cache;
SKX Core – Skylake Server Core; UPI – Intel® UltraPath Interconnect
New Mesh Interconnect Architecture
Mesh Improves Scalability with Higher Bandwidth and Reduced Latencies
Re-Architected L2 & L3 Cache HierarchyPrevious Architectures Skylake-SP Architecture
Shared L32.5MB/core(inclusive)
Core
L2(256KB private)
Core
L2(256KB private)
Core
L2(256KB private)
Shared L31.375MB/core(non-inclusive)
Core
L2(1MB private)
Core
L2(1MB private)
Core
L2(1MB private)
On-chip cache balance shifted from shared-distributed (prior architectures) to private-local (Skylake architecture): Shared-distributed shared-distributed L3 is primary cache Private-local private L2 becomes primary cache with shared L3 used as overflow cache
Shared L3 changed from inclusive to non-inclusive: Inclusive (prior architectures) L3 has copies of all lines in L2 Non-inclusive (Skylake architecture) lines in L2 may not exist in L3
Skylake-SP cache hierarchy architected specifically for Data center use case 32
Cache Performance
Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017 Intel Corporation.
Skylake-SP cache hierarchy
significantly reduces L2
misses without increasing L3
misses compared to Broadwell-EP
Lo
we
r is
be
tte
r
33
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Relative Change in L2 and L3 Misses Per Instruction for SPECint*_rate
2006 from Broadwell-EP to Skylake-SP
Relative L2 MPI Relative L3 MPIBDW-EP Baseline
Memory Subsystem
* Memory bandwidth improvements are based on Intel internal measurements of local memory bandwidth for read-only traffic using Intel Memory Latency Checker tool.
34
2x UPI x20 PCIe* x16 PCIe x16
DMI x4
CBDMA
PCIe x16
CHA/SF/LLC
Core
CHA/SF/LLC
Core
CHA/SF/LLC
Core
CHA/SF/LLC
Core
CHA/SF/LLC
Core
CHA/SF/LLC
Core
CHA/SF/LLC
Core
CHA/SF/LLC
Core
CHA/SF/LLC
Core
CHA/SF/LLC
Core
MCDDR 4
DDR 4
DDR 4
MC DDR 4
DDR 4
DDR 4
2x UPI x20 @
10.4GT/s
1x16/2x8/4x4
PCIe @ 8GT/s
1x16/2x8/4x4
PCIe @ 8GT/s
1x16/2x8/4x4
PCIe @ 8GT/s
x4 DMI
3X
DD
R4
-26
66
3x
DD
R4
-26
66
2 Memory Controllers, 3 channels each total of 6 memory channels
DDR4 up to 2666, 2 DIMMs per channel
Support for RDIMM, LRDIMM, and 3DS-LRDIMM
1.5TB Max Memory Capacity per Socket (2 DPC with 128GB DIMMs)
>60% increase* in memory bandwidth per socket compared to Intel® Xeon® processor E5 v4
Uniform and consistent access to local memory from any core
Several optimizations to lower latency and use bandwidth effectively
New memory failure detection and recovery with Adaptive Double Device Data Correction (ADDDC)
Significant memory bandwidth and capacity improvements Over Prior Generation
Memory PerformanceBandwidth-Latency Profile
Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1/SNC2, 6x32GB DDR4-2400/2666 per CPU, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
35
New Intel® Ultra Path Interconnect (Intel® UPI)
Intel® Ultra Path Interconnect (Intel® UPI), replacing Intel® QPI
Faster link with improved bandwidth for a balanced system design
Improved messaging efficiency per packet
3 UPI option for 2 socket – additional bandwidth for non-NUMA high bandwidth use cases
Intel® UPI enables system scalability with higher inter-socket bandwidth
36
75%50%
L0 L0pQPI
L0pUPI
Idle PowerData Rate
9.6 GT/s
10.4 GT/s
QPI UPI
Data Efficiency
4% to 21%
(per wire)
Source as of June 2017: Intel internal etimates on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, 6x32GB DDR4-2666, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel® Omni-Path ArchitectureEvolutionary Approach, Revolutionary Features, End-to-End Solution
Building on the industry’s best technologies Highly leverage existing Aries and Intel® True Scale fabric
Adds innovative new features and capabilities to improve performance, reliability, and QoS
Re-use of existing OpenFabrics Alliance* software
Robust product offerings and ecosystem End-to-end Intel product line
>100 OEM designs1
Strong ecosystem with 80+ Fabric Builders members
Software
Open SourceHost Software and
Fabric Manager
1 Source: Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standard Intel® OPA adapters. Design win count as of November 1, 2015 and subject to change without notice based on vendor product plans. *Other names and brands may be claimed as property of others.
38
HFI Adapters
Single portx8 and x16
x8 Adapter(58 Gb/s)
x16 Adapter
(100 Gb/s)
Edge Switches
1U Form Factor24 and 48 port
24-portEdge Switch
48-portEdge Switch
Director Switches
QSFP-based192 and 768 port
192-portDirector Switch
(7U chassis)
768-portDirector Switch
(20U chassis)
Cables
Third Party VendorsPassive Copper Active Optical
Silicon
OEM custom designsHFI and Switch ASICs
Switch siliconup to 48 ports
(1200 GB/stotal b/w
HFI siliconUp to 2 ports
(50 GB/s total b/w)
Maximizes price-performance, freeing up cluster budgets for increased compute and storage capability
39
Integrated Intel® Omni-Path Architecture Platform Benefits - Maximized I/O Density per Node
OPAHFI
SKX-F IFTCard
IFPCable
Up to TWO additional PCIe x16 slots are available for maximizing I/O density1
OPAHFI
SKX-For
SKX
x16
Intel Xeon Processor-F
HFI
GPU
Intel Xeon Processor-F
HFI
GPU
GPU
GPU
GPU
GPU
Compute Node
Intel Xeon Processor-F
HFI
Intel Xeon Processor-F
HFI
Storage Node or File System Server
x16
1 For illustrative purposes only. Assumes each CPU socket is configured with all 48 PCIe lanes routed to three x16 slots, or 96 total lanes for a 2S Purley platform. PCIe slot count and PCIe device support will vary by OEM platform, so check with your OEM for more details.
Significantly more I/O capacityfor compute or storage nodes1
SKUS WITH INTEGRATED INTEL® OMNI-PATH ARCHITECTURE FABRIC
Class SKU CoresBase Non-AVX
Speed (GHz)TDP (W)
Platinum 8176F 28 2.1 173
Platinum 8160F 24 2.1 160
Gold 6148F 20 2.4 160
Gold 6142F 16 2.6 160
Gold 6138F 20 2.0 135
Gold 6130F 16 2.1 135
Gold 6126F 12 2.6 105
Skylake-SP Architecture SummaryNew Architectural Innovations for Data Center
Up to 60% increase in compute density with Intel® AVX-512
Improved performance and scalability with Mesh on-chip interconnect
L2 and L3 cache hierarchy optimized for data center workloads
Improved memory subsystem with up to 60% higher memory bandwidth
Faster and more efficient Intel® UPI interconnect for improved scalability
Improved integrated IO with up to 50% higher aggregate IO bandwidth
Increased protection against kernel tampering and user data corruption
Core, cache, memory and IO improvements for increased virtual machine performance
Enhanced power management and RAS capability for improved utilization of resources
40