Intel Korea Sep 2017...Intel® Xeon® Processor E7 Targeted at mission critical applications that value a scale-up system with leadership memory capacity and advanced RAS Grantley-EP

Intel Korea Sep 2017

2

Notices and DisclaimersThis document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, the Intel logo, Intel Optane and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the united states and other countries.

* Other names and brands may be claimed as the property of others. © 2017 Intel Corporation.

http://www.intel.com/performance


4@IntelAI

The coming flood of dataBy 2020…

The average internet user will generate

~1.5 GB of traffic per daySmart hospitals will be generating over

3,000 GB per daySelf driving cars will be generating over

4,000 GB per day… each

All numbers are approximatedhttp://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttps://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.htmlhttp://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html

Self driving cars will be generating over

4,000 GB per day… eachA connected plane will be generating over

40,000 GB per dayA connected factory will be generating over

1,000,000 GB per day

radar ~10-100 KB per second

sonar ~10-100 KB per second

gps ~50 KB per second

lidar ~10-70 MB per second

cameras ~20-40 MB per second

1 car 5 exaflops per hour

5

The next big wave of computing

Source: Intel

AI Compute Cycles will grow by 2020 12X

mainframes Standards-based servers

Cloud computing

Artificial intelligence

Data delugeCOMPUTE breakthroughInnovation surge

6

Ai is transforming industries

Consumer Health Finance Retail Government Energy Transport Industrial OtherSmart

Assistants

Chatbots

Search

Personalization

Augmented Reality

Robots

Enhanced Diagnostics

Drug Discovery

Patient Care

Research

Sensory Aids

Algorithmic Trading

Fraud Detection

Research

Personal Finance

Risk Mitigation

Support

Experience

Marketing

Merchandising

Loyalty

Supply Chain

Security

Defense

Data Insights

Safety & Security

Resident Engagement

Smarter Cities

Oil & Gas Exploration

Smart Grid

Operational Improvement

Conservation

Automated Cars

Automated Trucking

Aerospace

Shipping

Search & Rescue

Factory Automation

Predictive Maintenance

PrecisionAgriculture

Field Automation

Advertising

Education

Gaming

Professional & IT Services

Telco/Media

Sports

Source: Intel forecast

exam

ples

Early adoption

7

A common language for AI Today

Memory based

Machine Learning

remember

act ADAPTreasonSENSE

Reasoning systems

Classic ML

Artificial intelligence

Deep learning Logic based

8

libraries Intel® MKL MKL-DNN Intel® MLSL

Intel® Deep Learning SDKtools

Frameworks

Intel® DAAL

hardwareMemory & Storage NetworkingCompute

Intel Dist

Mlib BigDL

Intel® Nervana™ Graph*

intel AI portfolioexperiences

Movidius MvTensor

Library

Associative Memory Base

E2E Tool

Lake Crest

Intel® Computer Vision SDK

Visual Intelligence*Coming 2017

*

Movidius Neural Compute Stick

9

✝Codename for product that is coming soonAll performance positioning claims are relative to other processor technologies in Intel’s AI datacenter portfolio*Knights Mill (KNM); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc.All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

AI DatacenterAll purpose Highly-parallel Flexible acceleration Deep Learning

Crest Family✝

Deep learning by designScalable acceleration with

best performance for intensive deep learning

training & inference

Intel®FPGA

Enhanced DL InferenceScalable acceleration for

deep learning inference in real-time with higher

efficiency, and wide range of workloads & configurations

Intel® Xeon® Processor Family

most agile AI PlatformScalable performance for

widest variety of AI & other datacenter workloads –including deep learning

training & inference

Intel® Xeon Phi™ Processor (Knights Mill✝)

Faster DL TrainingScalable performance

optimized for even faster deep learning training and

select highly-parallel datacenter workloads*

✝

Astrophysics Manufacturing

Energy SecurityFinancial

Life Sciences Climate

Weather

11

12

Growing Challenges in HPC

System Bottlenecks“The Walls”

Memory | I/O | StorageEnergy Efficient Performance

Space | Resiliency | Unoptimized Software

Divergent Workloads

Barriers to Extending Usage

Resources Split Among Modeling and Simulation | Big

Data Analytics | Machine Learning | Visualization

Democratization at Every Scale | Cloud Access |

Exploration of New Parallel Programming Models

BigDatahpc

Machine learning

visualization

OptimizingFor Cloud

13

Intel® Scalable System Framework

Many Workloads – one Framework

Modeling & Simulation Machine Learning VisualizationHPC Data Analytics

A Flexible Framework for Today & Tomorrow

Enabling Breakthrough

System Performance

Intel® Xeon® Processor Roadmap

Intel® Xeon® Processor E5Targeted at a wide variety of applications that value a balanced system with leadership performance/watt/$

18 cores

Intel® Xeon® Processor E7Targeted at mission critical applications that value a scale-up system with leadership memory capacity and advanced RAS

Grantley-EP Platform

E5 v3 E5-2600 v4

Brickland Platform

E7 v3 E7 v4

Purley Platform

Skylake

E5 v3 E5-4600 v4 (4S)

Cascade Lake

2016 2017 2018

Intel Xeon GOLD

Intel® Xeon® PLATINUM

Intel Xeon SILVER

Intel Xeon BRONZE

Converged platform with innovative Skylake-SP microarchitecture14

Intel® Xeon® Scalable Platform Feature Overview

Skylake-SP CPU

Skylake-SP CPU

2 or 3 Intel® UPI3x16 PCIe Gen3

3x16 PCIe* Gen3

DDR42666

Lewisburg PCH

4x10GbE NIC

Intel®QAT MEIE

High Speed IO

USB3

PCIe3SATA3

GPIOBMC

eSPI/LPCFirmware

FirmwareTPM

SPI10GbE

CPU VRs

OPA VRs

Mem VRs

OPA

DMI

OPA1x 100Gb OPA Fabric

1x 100Gb OPA Fabric

BMC: Baseboard Management Controller PCH: Intel® Platform Controller Hub IE: Innovation Engine

Intel® OPA: Intel® Omni-Path Architecture Intel QAT: Intel® QuickAssist Technology ME: Manageability Engine

NIC: Network Interface Controller VMD: Volume Management Device NTB: Non-Transparent Bridge

Feature Details

Socket Socket P

Scalability 2S, 4S, 8S, and >8S (with node controller support)

CPU TDP 70W – 205W

Chipset Intel® C620 Series (code name Lewisburg)

Networking Intel® Omni-Path Fabric (integrated or discrete)4x10GbE (integrated w/ chipset)100G/40G/25G discrete options

Compression and Crypto Acceleration

Intel® QuickAssist Technology to support 100Gb/s comp/decomp/crypto 100K RSA2K public key

Storage Integrated QuickData Technology, VMD, and NTBIntel® Optane™ SSD, Intel® 3D-NAND NVMe &SATA SSD

Security CPU instruction enhancements (MBE, PPK, MPX)Manageability EngineIntel® Platform Trust TechnologyIntel® Key Protection Technology

Manageability Innovation Engine (IE)Intel® Node ManagerIntel® Datacenter Manager

15

DMI x4**

Platform Topologies8S Configuration

SKLSKL

LBG

LBG

LBG

DMI

LBG

SKLSKL

SKLSKL

SKLSKL

3x16 PCIe*

4S Configurations

SKLSKL

SKLSKL

2S Configurations

SKLSKL

(4S-2UPI & 4S-3UPI shown)

(2S-2UPI & 2S-3UPI shown)

Intel®UPI

LBG 3x16 PCIe* 1x100G

Intel® OP Fabric

3x16 PCIe* 1x100G

Intel® OP Fabric

LBGLBG

LBG

DMI

3x16 PCIe*

Intel® Xeon® Scalable Processor supports configurations ranging from 2S-2UPI to 8S

16

Skylake core microarchitecture with data center specific enhancements

Intel® AVX-512 with 32 DP flops per core

Data center optimized cache hierarchy –1MB L2 per core, non-inclusive L3

New mesh interconnect architecture

Enhanced memory subsystem

Modular IO with integrated devices

New Intel® Ultra Path Interconnect (Intel® UPI)

Intel® Speed Shift Technology

Security & Virtualization enhancements (MBE, PPK, MPX)

Optional Integrated Intel® Omni-Path Fabric (Intel® OPA)

Intel® Xeon® Scalable ProcessorsRe-architected from the ground up

Features Intel® Xeon® Processor E5-2600 v4 Intel® Xeon® Scalable Processor

Cores Per Socket Up to 22 Up to 28

Threads Per Socket Up to 44 threads Up to 56 threads

Last-level Cache (LLC) Up to 55 MB Up to 38.5 MB (non-inclusive)

QPI/UPI Speed (GT/s) 2x QPI channels @ 9.6 GT/s Up to 3x UPI @ 10.4 GT/s

PCIe* Lanes/ Controllers/Speed(GT/s)

40 / 10 / PCIe* 3.0 (2.5, 5, 8 GT/s) 48 / 12 / PCIe 3.0 (2.5, 5, 8 GT/s)

Memory Population4 channels of up to 3 RDIMMs,

LRDIMMs, or 3DS LRDIMMs6 channels of up to 2 RDIMMs,

LRDIMMs, or 3DS LRDIMMs

Max Memory Speed Up to 2400 Up to 2666

TDP (W) 55W-145W 70W-205W

Core Core

Core Core

Core Core

Shared L3

UPI

UPI

2 or 3 UPI

6 Channels DDR4

48 Lanes PCIe* 3.0

DMI3

DDR4

DDR4

DDR4

DDR4

DDR4

DDR4

UPI

Omni-Path HFIOmni-Path

17

Core Microarchitecture Enhancements

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

19

Broadwell uArch

Skylake uArch

Out-of-order Window

192 224

In-flight Loads + Stores

72 + 42 72 + 56

Scheduler Entries 60 97Registers –Integer + FP

168 + 168 180 + 168

Allocation Queue 56 64/thread

L1D BW (B/Cyc) –Load + Store

64 + 32 128 + 64

L2 Unified TLB 4K+2M: 10244K+2M: 1536

1G: 16

Load Buffer

Store Buffer

Reorder Buffer

5

6

Scheduler

Allocate/Rename/RetireIn order

OOO

INT

VE

C

Port 0 Port 1

MUL

ALU

FMA

ShiftALU

LEA

Port 5

ALU

ShuffleALU

LEA

Port 6

JMP 1

ALU

Shift

JMP 2

ALU

ALU

DIVShift

Shift

FMA

Port 4

32KB L1 D$

Port 2

Load/STAStore Data

Port 3

Load/STA

Port 7

STA

Load Data 2

Load Data 3 Memory Control

Fill Buffers

Fill Buffers

μop Cache

32KB L1 I$ Pre decode Inst QDecodersDecodersDecodersDecoders

Branch Prediction Unit

μopQueue

Memory

Front End

1MB L2$

FMA

About 10% performance improvement per core on integer applications at same frequency

• Larger and improved branch predictor, higher throughput decoder, larger window to extract ILP• Improved scheduler and execution engine, improved throughput and latency of divide/sqrt • More load/store bandwidth, deeper load/store buffers, improved prefetcher• Data center specific enhancements: Intel® AVX-512 with 2 FMAs per core, larger 1MB MLC

http://www.intel.com/benchmarks

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

• 512-bit wide vectors

• 32 operand registers

• 8 64b mask registers

• Embedded broadcast

• Embedded rounding

Microarchitecture Instruction Set SP FLOPs / cycle DP FLOPs / cycle

Skylake Intel® AVX-512 & FMA 64 32

Haswell / Broadwell Intel AVX2 & FMA 32 16

Sandybridge Intel AVX (256b) 16 8

Nehalem SSE (128b) 8 4

Intel AVX-512 Instruction Types

AVX-512-F AVX-512 Foundation Instructions

AVX-512-VL Vector Length Orthogonality : ability to operate on sub-512 vector sizes

AVX-512-BW 512-bit Byte/Word support

AVX-512-DQ Additional D/Q/SP/DP instructions (converts, transcendental support, etc.)

AVX-512-CD Conflict Detect : used in vectorizing loops with potential address conflicts

Powerful instruction set for data-parallel computation

20

Performance and Efficiency with Intel® AVX-512

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

669

11782034

3259

760 768 791 767

3.12.8

2.5

2.1

0

0.5

1

1.5

2

2.5

3

3.5

0

500

1000

1500

2000

2500

3000

3500

SSE4.2 AVX AVX2 AVX512

Co

re F

req

ue

ncy

GF

LO

Ps,

Sy

ste

m P

ow

er

LINPACK Performance

GFLOPs Power (W) Frequency (GHz)

1.001.74

2.92

4.83

0.00

2.00

4.00

6.00

SSE4.2 AVX AVX2 AVX512No

rma

liz

ed

to

SS

E4

.2

GF

LO

Ps/

Wa

tt

GFLOPs / Watt

1.001.95

3.77

7.19

0.00

2.00

4.00

6.00

8.00

SSE4.2 AVX AVX2 AVX512No

rma

liz

ed

to

SS

E4

.2

GF

LO

Ps/

GH

z

GFLOPs / GHz

Intel® AVX-512 delivers significant performance and efficiency gains

21

22

1.63x Average Gains on High Performance Compute Apps

1.00

1.42

1.87

2.38

1.56 1.58 1.67 1.73 1.75

1.41 1.44 1.52 1.41 1.68

-

0.50

1.00

1.50

2.00

2.50

3.00

Broadwell

(E5-2697 v4)

WRF* HOMME* LSTC

LS-DYNA

Explicit*

INTES

PERMAS* V16

MILC* GROMACS* VASP* NAMD* LAMMPS* Amber GB* Binomial

option pricing

Black-Scholes Monte Carlo

European

options

Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Geomean of

Weather Research Forecasting - Conus 12Km, HOMME, LSTCLS-DYNA Explicit, INTES PERMAS V16, MILC, GROMACS

water 1.5M_pme, VASPSi256, NAMDstmv, LAMMPS, Amber GB Nucleosome, Binomial option pricing, Black-Scholes, Monte

Carlo European options. Any difference in system hardware or software design or configuration may affect actual performance.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,

operations and functions. Any change to any of those factors may cause the results to vary. You should consult other

information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of

that product when combined with other products. For more information go to http://www.intel.com/performance/datacenter.

Configurations: see page 54,55

Manufacturing Life Sciences FSIEarth Systems Models

Intel® Xeon® Gold 6148 processor

Benefits of Intel® Xeon® processor Scalable family Intel® AVX-512 and Higher IPC/Per Core performance

Balanced IO and memory

Predictable latency with mesh/memory architecture improvements

Additional options to scale performance Intel® Optane™ SSD DC P4800X Series for fast storage

Intel® Omni-Path Fabric to scale cluster performance

Higher is better

23

LSTC LS-DYNA ExplicitApplication / Workload Description:• LS-DYNA is a popular crash simulation application. It is used by the

automobile, aerospace, construction, military, manufacturing, and bioengineering industries worldwide.

Key Takeaway: • All major automakers and aerospace customers can benefit from the

increased performance

• Faster simulation turnover

• Influencing customers to migrate to Intel® AVX-512

Performance Factors:• More cores and threads, 50% more memory bandwidth and an

improved cache hierarchy.

• Additional performance improvement with Intel® AVX-512

Up to YY%faster

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. *Other names and brands may be claimed as the property of others. 1. Testing conducted on ISV* software comparing Intel® Xeon® Gold 6148 processor to 2S Intel® Xeon® Processor E5-2697 v3 and to 2S Intel® Xeon® Processor E5-2697 v4. Testing done by Intel. For complete testing configuration details, see slide 29.

No

rma

lize

d P

erf

orm

an

ce

www.lstc.com

Up to 2.4X

faster

LS-DYNA explicit increased performance by up to 1.79X with the Intel® Xeon® Gold 8164

processor

Up to 23%

faster

Workload: 2M elements Car2car model with 120ms simulation time

Up to 25%

faster

../to be edited

http://www.lstc.com/

24

Monte Carlo European OptionsApplication / Workload Description:• Monte Carlo is a numerical method that uses statistical sampling techniques to

approximate solutions to quantitative problems. In finance, Monte Carlo algorithms are used to evaluate complex instruments, portfolios, and investments. This is compute bound, double precision workload.

Key Takeaway: • Higher performance allows either doing the same work faster leading to improved

TCO or simulation of more paths leading to higher confidence in results.

Performance Factors:• Using Intel® AVX-512 SIMD vectorization improved performance by 1.85X over

Intel® AVX2.

• Higher core counts of Intel Xeon® Gold 6148 processor contributes to higher performance.

• Better memory hierarchy adds to the performance.

• Code modernization strategy: Parallelizing outer loop over options and vectorizeinner loop of paths.

Up to YY%faster

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. *Other names and brands may be claimed as the property of others. 1 - Testing conducted on Black-Scholes software comparing 2S Intel® Xeon® Processor E5-2697v3 to 2S Intel® Xeon® Processor E5-2697v4 and 2S Intel® Xeon® Gold 6148 processor. Testing done by Intel. For complete testing configuration details, see slide 31.

Up to 1.72Xfaster

0

1

2

2S Intel® Xeon® processor E5-2697 v3



Up to 1.3X

faster

Performance Metric: Speed-up using options/sec

Increased Monte Carlo European Option performance with the 2S Intel® Xeon® Gold 6148

processor

No

rma

lize

d P

erf

orm

an

ce

Up to 3.1X

faster

25

LAMMPSApplication / Workload Description:• LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale

Atomic/Molecular Massively Parallel Simulator. It is used to simulate the movement of atoms to develop better therapeutics, improve alternative energy devices, develop new materials, and more.

Key Takeaway: • The improved performance allows for longer time scales, larger simulations,

and/or improved sampling and statistics.

• The continued advances in molecular dynamics performance on Intel® architecture allow computational scientists to solve new and more complex problems.

Performance Factors:• Intel® AVX-512 – Up to 49% gain1 versus Intel® AVX2

Up to YY%faster

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. *Other names and brands may be claimed as the property of others.

Up to 1.72Xfaster

Testing conducted on LAMMPS* code comparing 2S Intel® Xeon® Gold 6148 processor to 2S Intel® Xeon® Processor E5-2697 v3 and to 2S Intel® Xeon® Processor E5-2697 v4. Reported Intel® AVX-512 gains are compared to running an Intel® AVX2 binary using all cores on the same platform. Reported increased number of cores gains are compared to running reduced number of cores on the same platform. Testing done by Intel. For complete testing configuration details, see slide 32.

No

rma

lize

d P

erf

orm

an

ce

0

1

2




Up to 39%

faster

Workload: LAMMPS CG Water Simulation.

lammps.sandia.gov

Up to 2.4X

faster

LAMMPS* increased performance1 with the 2S Intel® Xeon® Gold 6148

../to be edited

26

https://www.3ds.com/

Simulia Abaqus StandardApplication / Workload Description:• Abaqus Standard gives manufacturers an effective way to analyze

static and low-speed dynamic events where precise stress solutions are vital. A single simulation can analyze a model in both the time and frequency domains. Examples include sealing pressure in a gasket joint, steady-state rolling of a tire, or crack propagation in a composite airplane fuselage.

Key Takeaway: • Faster product design time

• Ability to solve more complex models on the same hardware footprint

Performance Factors:• Increased core count, higher frequencies and greater memory

bandwidth of the Intel® Xeon® Platinum 6148 processor were key to the performance gain.

• Intel® AVX-512 provides a 25% gain compared to Intel® AVX

Up to YY%faster

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. *Other names and brands may be claimed as the property of others. 1. Testing conducted on Simulia* software comparing Intel® Xeon® Gold 6148 processor to 2S Intel® Xeon® Processor E5-2698 v3 and to 2S Intel® Xeon® Processor E5-2697 v4 Testing done by Intel. For complete testing configuration details, see slide 33.

No

rma

lize

d P

erf

orm

an

ce Up to 2.4X

faster

Increased Simulia Abaqus Standard performance with the Intel® Xeon® Gold 6148 processor

Up to 23%

faster

Up to 1.80xfaster

Workload: s2b flywheel with centrifugal load

../to be edited

27

Delivering Performance Beyond BenchmarksC

lou

d

1.5Xcloud monitoring4

A C L O M E

1.72Xvideo stitching5

C L O U D

1.74Xclick-through-rate1

S E A R C H

1.62Xenterprise cloud applications2

F U S H I O N S P H E R E

1.63XOLTP database3

M Y S Q L C L O U D S E R V I C E

1.59Xdatabase transactions9

H A N A

1.72Xmolecular dynamics8

1.47Xin-memory analytics6

D B 2

1.68Xenterprise risk management7

A N A L Y T I C S R I S K E N G I N E

2Xbusiness analytics10

AI

&

An

aly

tics

1.5Xvideo transcoding13

M E D I A F I R S T

2.21Xbusiness support system11

V E R I S

1.67Xrouting15

V I R T U A L B N G

1.9XHEVC video encoding12

E B L I V E

1.64Xpacket inspection14

V I R T U A L S E R I E S

Ne

two

rk

28

1.65x Average1 Generational Gains on 2-Socket Servers

with Intel® Xeon® Scalable Processor

1.33 1.40 1.441.53 1.58 1.65 1.65 1.73 1.73 1.77

1.87

2.27

0

0.5

1

1.5

2

2.5

E5-26xx v4Baseline

TPC*-E SPECvirt_sc* 2013

Two-tierSAP SD*(Linux)

SPECint*_rate

_base2006

SPECjbb2015*MultiJVM

critical-jOPS

SPECfp*_rate

_base2006

STREAM*Triad

HammerDB* LAMMPS DPDK L3Packet

Forwarding

BlackScholes

Intel®Distribution

for LINPACK

1-node 2x Intel® Xeon® processor E5-26xx v4 ("Broadwell-EP 2S") 1-node 2x Intel® Xeon® Scalable processor

Infr

astr

uctu

re A

pp

virt

ualiz

atio

n

Gen

eral

In

tege

r A

pp

Thr

ough

put

OLT

P D

atab

ase

Per

form

ance

Tec

hnic

al C

ompu

te A

pp

Thr

ough

put

LIN

PA

CK

Thr

ough

put

Mem

ory

Ban

dwid

th

Rela

tive 2

S P

erf

orm

ance

Higher is better

Java

* B

usin

ess

Ops

Crit

ical

jOP

S

Bro

kera

ge F

irm O

LTP

Average 1.65

Ent

erpr

ise

Sal

es a

nd

Dis

trib

utio

n (L

inux

)

Net

wor

k L3

Pac

ket

For

war

ding

HP

C –

Mol

ecul

ar D

ynam

ics

FS

I –O

ptio

ns P

ricin

g

1 Geomean based on Normalized Generational Performance (estimates based on Intel internal testing and published results of TPC-E, SPECvirt_sc*2013, SAP SD 2-Tier, SPEC*int_rate_base2006, SPEC*fp_rate_base2006, SPECjbb2015* MultiJVM, STREAM* triad, HammerDB, LAMMPS, DPDK L3 Packet Forwarding, Black-Scholes, Intel Distribution for LINPACK.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. Configurations: see slides 15, 16. *Other names and brands may be claimed as the property of others.


Intel® Xeon® Interconnect Journey

30

Cbo

UPDNLLC Core

Cbo

UPDNLLC Core

Cbo

UPDNLLC Core

Cbo

UPDNLLC Core

UP

DN

Cbo

Core LLCUP DN

Cbo

Core LLCUP DN

Cbo

Core LLCUP DN

Cbo

Core LLCUP DN

CSI Agent2x20, 9.6 GT/s

Home

Agent

CSI

3.6 GHz, 32B/dir = 115.2 GB/s/dir/hop

UP

DN

DDR3

Mem

Ctlr4ch, 1600,

8B

12.8 GB/s

12.8 GB/s

12.8 GB/s

12.8 GB/s

UP

IIOIOAPIC DMA

X4

DM

I

PCI-E

X16

16

GB

/s/d

ir

DN

PCI-E

X16

PCI-E

X8DMI

16

GB

/s/d

ir

8 G

B/s

/dir

UP

CSI

DN

19.

2 G

B/s

/dir

19.

2 G

B/s

/dir

UP

DN

UP

DN

UBox PCU

IDI/CSI

Msg Ch

Globally routed

CBO

IDI/Q

PII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

DNUP

D

N

U

P

D

N

U

P

CBO

IDI/Q

PII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

QPI Agent

QPI

Link

R3QPI

QPI

Link

IIO

R2PCI

PCI-E

X16

IOAPIC

CB DMA

PCI-E

X16

PCI-E

X8

PCI-E

X4 (ESI)UBoxPCU

Home AgentDDR

Mem CtlrDDR

Home AgentDDR

Mem CtlrDDR

2009: Intel® Xeon® processor X75xx, 45nm

2011: Intel® Xeon® processor E7, 32nm 2012: Intel® Xeon® processor E5, 32nm

2013: Intel® Xeon® processor E7 v2, 22nm 2014: Intel® Xeon® processor E7 v3, 22nm

2016: Intel® Xeon® processor E7 v4, 14nm

2016: Intel® Xeon® processor E7 v4, 14nm(Broadwell EX 24-core die)

2017: Intel® Xeon® Scalable Processor, 14nm(Skylake-SP 28-core die)

CBO

IDI/QPII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/QPII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

IDI/QPII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/QPII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

DNUP

D

N

U

P

D

N

U

P

CBO

IDI/QPII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/QPII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/QPII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/QPII

IDI

CoreCore

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

IDI/Q

PII

IDI Core

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/QPII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/QPII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/QPII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/QPII

IDI U

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

CBO

D

N

IDI/Q

PII

IDIU

PCore

Core

BoCache

BoSAD

LLC2.5MB

QPI Agent

QPI

Link

R3QPI

QPI

Link

IIO

R2PCI

PCI-E

X16

IOAPIC

CB DMA

PCI-E

X16

PCI-E

X8

PCI-E

X4 (ESI)UBoxPCU

Home AgentDDR

Mem CtlrDDR

Home AgentDDR

Mem CtlrDDR

31

*2x UPI x20 PCIe* x16 PCIe x16

DMI x4CBDMA

On Pkg

PCIe x16

1x UPI x20 PCIe x16

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

CHA/SF/LLC

SKX Core

MCDDR4

DDR4

DDR4

MC DDR4

DDR4

DDR4

CHA – Caching and Home Agent; SF – Snoop Filter; LLC – Last Level Cache;

SKX Core – Skylake Server Core; UPI – Intel® UltraPath Interconnect

New Mesh Interconnect Architecture

Mesh Improves Scalability with Higher Bandwidth and Reduced Latencies

Re-Architected L2 & L3 Cache HierarchyPrevious Architectures Skylake-SP Architecture

Shared L32.5MB/core(inclusive)

Core

L2(256KB private)

Core

L2(256KB private)

Core

L2(256KB private)

Shared L31.375MB/core(non-inclusive)

Core

L2(1MB private)

Core

L2(1MB private)

Core

L2(1MB private)

On-chip cache balance shifted from shared-distributed (prior architectures) to private-local (Skylake architecture): Shared-distributed shared-distributed L3 is primary cache Private-local private L2 becomes primary cache with shared L3 used as overflow cache

Shared L3 changed from inclusive to non-inclusive: Inclusive (prior architectures) L3 has copies of all lines in L2 Non-inclusive (Skylake architecture) lines in L2 may not exist in L3

Skylake-SP cache hierarchy architected specifically for Data center use case 32

Cache Performance

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance. Copyright © 2017 Intel Corporation.

Skylake-SP cache hierarchy

significantly reduces L2

misses without increasing L3

misses compared to Broadwell-EP

Lo

we

r is

be

tte

r

33

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Relative Change in L2 and L3 Misses Per Instruction for SPECint*_rate

2006 from Broadwell-EP to Skylake-SP

Relative L2 MPI Relative L3 MPIBDW-EP Baseline


Memory Subsystem

* Memory bandwidth improvements are based on Intel internal measurements of local memory bandwidth for read-only traffic using Intel Memory Latency Checker tool.

34

2x UPI x20 PCIe* x16 PCIe x16

DMI x4

CBDMA

PCIe x16

CHA/SF/LLC

Core

CHA/SF/LLC

Core

CHA/SF/LLC

Core

CHA/SF/LLC

Core

CHA/SF/LLC

Core

CHA/SF/LLC

Core

CHA/SF/LLC

Core

CHA/SF/LLC

Core

CHA/SF/LLC

Core

CHA/SF/LLC

Core

MCDDR 4

DDR 4

DDR 4

MC DDR 4

DDR 4

DDR 4

2x UPI x20 @

10.4GT/s

1x16/2x8/4x4

PCIe @ 8GT/s

1x16/2x8/4x4

PCIe @ 8GT/s

1x16/2x8/4x4

PCIe @ 8GT/s

x4 DMI

3X

DD

R4

-26

66

3x

DD

R4

-26

66

2 Memory Controllers, 3 channels each total of 6 memory channels

DDR4 up to 2666, 2 DIMMs per channel

Support for RDIMM, LRDIMM, and 3DS-LRDIMM

1.5TB Max Memory Capacity per Socket (2 DPC with 128GB DIMMs)

>60% increase* in memory bandwidth per socket compared to Intel® Xeon® processor E5 v4

Uniform and consistent access to local memory from any core

Several optimizations to lower latency and use bandwidth effectively

New memory failure detection and recovery with Adaptive Double Device Data Correction (ADDDC)

Significant memory bandwidth and capacity improvements Over Prior Generation

Memory PerformanceBandwidth-Latency Profile

Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1/SNC2, 6x32GB DDR4-2400/2666 per CPU, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

35


New Intel® Ultra Path Interconnect (Intel® UPI)

Intel® Ultra Path Interconnect (Intel® UPI), replacing Intel® QPI

Faster link with improved bandwidth for a balanced system design

Improved messaging efficiency per packet

3 UPI option for 2 socket – additional bandwidth for non-NUMA high bandwidth use cases

Intel® UPI enables system scalability with higher inter-socket bandwidth

36

75%50%

L0 L0pQPI

L0pUPI

Idle PowerData Rate

9.6 GT/s

10.4 GT/s

QPI UPI

Data Efficiency

4% to 21%

(per wire)

Source as of June 2017: Intel internal etimates on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, 6x32GB DDR4-2666, 1 DPC, and platform with E5-2699 v4, Turbo enabled, 4x32GB DDR4-2400, RHEL 7.0. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.


Intel® Omni-Path ArchitectureEvolutionary Approach, Revolutionary Features, End-to-End Solution

Building on the industry’s best technologies Highly leverage existing Aries and Intel® True Scale fabric

Adds innovative new features and capabilities to improve performance, reliability, and QoS

Re-use of existing OpenFabrics Alliance* software

Robust product offerings and ecosystem End-to-end Intel product line

>100 OEM designs1

Strong ecosystem with 80+ Fabric Builders members

Software

Open SourceHost Software and

Fabric Manager

1 Source: Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom and/or standard Intel® OPA adapters. Design win count as of November 1, 2015 and subject to change without notice based on vendor product plans. *Other names and brands may be claimed as property of others.

38

HFI Adapters

Single portx8 and x16

x8 Adapter(58 Gb/s)

x16 Adapter

(100 Gb/s)

Edge Switches

1U Form Factor24 and 48 port

24-portEdge Switch

48-portEdge Switch

Director Switches

QSFP-based192 and 768 port

192-portDirector Switch

(7U chassis)

768-portDirector Switch

(20U chassis)

Cables

Third Party VendorsPassive Copper Active Optical

Silicon

OEM custom designsHFI and Switch ASICs

Switch siliconup to 48 ports

(1200 GB/stotal b/w

HFI siliconUp to 2 ports

(50 GB/s total b/w)

Maximizes price-performance, freeing up cluster budgets for increased compute and storage capability

39

Integrated Intel® Omni-Path Architecture Platform Benefits - Maximized I/O Density per Node

OPAHFI

SKX-F IFTCard

IFPCable

Up to TWO additional PCIe x16 slots are available for maximizing I/O density1

OPAHFI

SKX-For

SKX

x16

Intel Xeon Processor-F

HFI

GPU


HFI

GPU

GPU

GPU

GPU

GPU

Compute Node


HFI


HFI

Storage Node or File System Server

x16

1 For illustrative purposes only. Assumes each CPU socket is configured with all 48 PCIe lanes routed to three x16 slots, or 96 total lanes for a 2S Purley platform. PCIe slot count and PCIe device support will vary by OEM platform, so check with your OEM for more details.

Significantly more I/O capacityfor compute or storage nodes1

SKUS WITH INTEGRATED INTEL® OMNI-PATH ARCHITECTURE FABRIC

Class SKU CoresBase Non-AVX

Speed (GHz)TDP (W)

Platinum 8176F 28 2.1 173

Platinum 8160F 24 2.1 160

Gold 6148F 20 2.4 160

Gold 6142F 16 2.6 160

Gold 6138F 20 2.0 135

Gold 6130F 16 2.1 135

Gold 6126F 12 2.6 105

Skylake-SP Architecture SummaryNew Architectural Innovations for Data Center

Up to 60% increase in compute density with Intel® AVX-512

Improved performance and scalability with Mesh on-chip interconnect

L2 and L3 cache hierarchy optimized for data center workloads

Improved memory subsystem with up to 60% higher memory bandwidth

Faster and more efficient Intel® UPI interconnect for improved scalability

Improved integrated IO with up to 50% higher aggregate IO bandwidth

Increased protection against kernel tampering and user data corruption

Core, cache, memory and IO improvements for increased virtual machine performance

Enhanced power management and RAS capability for improved utilization of resources

40

Documents

Intel Korea Sep 2017...Intel® Xeon® Processor E7 Targeted at mission critical applications that value a scale-up system with leadership memory capacity and advanced RAS Grantley-EP