Accelerating Time to Insight - ECMWF Events (Indico) · EETimes 08.29.10 Details of Hailo AI Edge Accelerator Emerge Hot Chips 2019, Cerebras Presentation Claimed performance 2.8

©2018 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject

to change without notice. All information is provided on an “AS IS” basis without warranties of any kind.

Statements regarding products, including regarding their features, availability, functionality, or

compatibility, are provided for informational purposes only and do not modify the warranty, if any,

applicable to any product. Drawings may not be to scale. Micron, the Micron logo, and all other Micron

trademarks are the property of Micron Technology, Inc. All other trademarks are the property of their

respective owners.

Balint FleischerSenior Director

Advanced Computing Solutions

September 25th, 2019

Accelerating Time to Insight Performance optimized emerging

system architectures

1

Estimated data

1 Zettabyte = 10^21 (one billion, trillion) bytes

Rapidly

growing

Dark Data

2

Rapidly growing Data sets from many sources

+Increasing complex algorithms

+Need for faster Time to Insight

+Affordability challenges

Dark Data

Perfect storm:

Moore’s law is

slowing,

Dennard scaling is

ending and

von Neumann

architecture

became a bottleneck

B. Meisner, The Bump in the road to ExaFlops and Rethinking LINPACK,

HPC User Forum, June 2014

3

Large,

industry wide

approach but

without a

master plan

to insure

continued

scaling

New, more scalable

Architectures

Reducing

Data access cost

Making IO a first

Class “citizen”

Bridge the latency gap with

New Memory technologies

4

5

New, Non Von

Neumann

Architectures

Scale well (for

a while) with

process

technology

improvements

General

purpose CPU

performance

CAGR is

declining

6

Machine Learning

+

Deep Learning

emerging as the

key application for

data analytics

A rich target for

Domain specific

computing

7

8

On a given

technology node,

Domain Specific

Architectures

deliver greater

performance on

targeted

applications

Pattern for domain specific architecturesSimpler, lower performance, more efficient processing elements

High degree of parallelism (1000s of processing elements)

Highly optimized on die data movement

New, application specific ISA

Custom compiler

9

An example of

Domain Specific

Architecture

delivering greater

performance

Optimized AI

processors deliver

even better gain

NVIDIA Volta architecture white paper

~400% perf

jump vs. 25%

for CPU

There are over 300 AI processor designs worldwide innovating from Low End to High End

10

EETimes 08.29.10 Details of Hailo AI Edge Accelerator Emerge

Hot Chips 2019, Cerebras Presentation

Claimed performance

2.8 TOPS @1.6W

~2x faster vs. CPU

1/15th power vs. CPU

11

“Cost” of IO is critical to

performance scaling and

energy consumption

The goal is improving

Bandwidth, reducing

Latency & reducing Data

movement

Improving Data

Proximity is a

vehicle to

address IO cost

Processing

Element* For simplicity single socket CPU Hosts are shown

12

Various approaches

NAND die capacity

continues to grow

enabling more

dense SSDs and

storage solutions

13

Difficult to Beat the NAND Cost Structure

SSD IO Bandwidth

is lagging SSD

Capacity growth

14

Query of very large data sets

Will take an increasingly long time

NAND stacks and

SSDs are designed for

capacity scaling

Device level aggregate

BW is ~100x vs. SSD

External IO Bandwidth

Die level Bandwidth is

~1000x

15

Moving processing into

SSD can benefit from the

high Embedded

Bandwidth

Time to Insight on very

large shardabledata will

also benefit from

Massive parallelism

9.6 TB/s aggregate untapped RAW

internal BW

Usable BW can be as much as ~25%

652 GB/s aggregate IO

BW at SSD connector

CPU

DRAM

Ho

st

54 GB/s

652 GB/s

(24x13.6 GB/s)

SSD SSD SSD

8% of Drive Bay BW

CPU

DRAM

Ho

st

SSD SSD SSD

54 GB/s

9.6 TB/s

(24x400 GB/s)

.05% of NAND BW

Potential Internal BW

Parallel Processing coupled with

Near Data Computing

enables faster time to insight

NAND

Devices

Opening up

this to

applications

Note: In both cases there are 24 drives per JBOF, identical PCIe G5 NVMe interface.

* For simplicity single socket CPU Hosts are shown

16

Near Memory Computing

can Deliver 5-10x better

Bandwidth to applications

in a 2.5D packaging

technology

17

Tighter integration

of processing and

memory can lead to

even greater

Performance at

lower Power [Kim et al., NeuroCube, ISCA 2016]

18

Near Memory Computing example

On Die integration of

compute and memory

can take advantage of

~160x on Die bandwidth

on smaller Data sets

19

Processing

Element

20

Large

investment into

improving

Platform IO

Increasing

integrated IO lane

count to grow

connectivity

Faster PCIe will

enable higher

speed devices

Silicon photonics to

extend reach21


22

Emerging Media

technologies to

improve capacity,

latency and

access methods

Emerging Memory

research is

intensifying

23

Difficult to beat

DRAM Performance

& Energy

But

DRAM density growth

is slowing

Lower

LatencyHigher

Density

Emerging

Memories can fill

data access

latency gaps and

enable new

storage models

24

Attaching EM

requires server

architecture changes

CXL is the emerging

Standard for EM

attach

25

PCIe

PCIe Gen4

IO devices

SSD

Accelerators

CXL

PCIe Gen5

2x IO Bandwidth

IO devices

SSD

Emerging Memory

Coherent Accelerators

Today ~2022

* For simplicity sake single CPU Hosts are shown

CPU

DRAM

Ho

st

CPU

DRAM

Ho

st

IO

semantics

Memory

semanticsMemory

semantics

Memory + IO

semantics

CXL enables

innovations ranging

from different

memory types to

heterogeneous

computing

26

CXL

Block

EMM

CXL

LD/ST

EMM

CXL

LD/ST

DRAM

CXL

AI Engine

NAND

CXL

Near Memory

DRAM

Some possible memory examples

Scaling Emerging

Memory capacity

is critical to address

use case

requirements

27

~TB

* For simplicity, single CPU Hosts are shown

CXL

CPU

DRAM

Ho

st

Many TB

Memory Semantics Memory Semantics

CPU

DRAM

Ho

st

54 GB/s652 GB/s

(24x13.6 GB/s)CXL expander

CXL

Emerging memory based

Expansion modules

The collection of

these new

technologies and

architectures will

impact how we

build future Data

Centers and

Systems

28

Co locating Compute

and Data requires

change in data

placement,

provisioning and

load balancing

strategies


Strict provisioning

and

data placement rules

Emergence of

non uniform

resource pools

Complex data

Partitioning models

29

Heterogeneous

Resource

Pools

Example

30


Exciting times.

The Data Growth and the need for Faster Insight drives transitioning from decades old architectures to a new, emerging model utilizing

breakthrough technologies

Summary

Buckle your seatbelt!

31

32

Thank you!

[email protected]

Documents

Accelerating Time to Insight - ECMWF Events (Indico) · EETimes 08.29.10 Details of Hailo AI Edge Accelerator Emerge Hot Chips 2019, Cerebras Presentation Claimed performance 2.8