Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
©2018 Micron Technology, Inc. All rights reserved. Information, products, and/or specifications are subject
to change without notice. All information is provided on an “AS IS” basis without warranties of any kind.
Statements regarding products, including regarding their features, availability, functionality, or
compatibility, are provided for informational purposes only and do not modify the warranty, if any,
applicable to any product. Drawings may not be to scale. Micron, the Micron logo, and all other Micron
trademarks are the property of Micron Technology, Inc. All other trademarks are the property of their
respective owners.
Balint FleischerSenior Director
Advanced Computing Solutions
September 25th, 2019
Accelerating Time to Insight Performance optimized emerging
system architectures
1
Estimated data
1 Zettabyte = 10^21 (one billion, trillion) bytes
Rapidly
growing
Dark Data
2
Rapidly growing Data sets from many sources
+Increasing complex algorithms
+Need for faster Time to Insight
+Affordability challenges
Dark Data
Perfect storm:
Moore’s law is
slowing,
Dennard scaling is
ending and
von Neumann
architecture
became a bottleneck
B. Meisner, The Bump in the road to ExaFlops and Rethinking LINPACK,
HPC User Forum, June 2014
3
Large,
industry wide
approach but
without a
master plan
to insure
continued
scaling
New, more scalable
Architectures
Reducing
Data access cost
Making IO a first
Class “citizen”
Bridge the latency gap with
New Memory technologies
4
5
New, Non Von
Neumann
Architectures
Scale well (for
a while) with
process
technology
improvements
General
purpose CPU
performance
CAGR is
declining
6
Machine Learning
+
Deep Learning
emerging as the
key application for
data analytics
A rich target for
Domain specific
computing
7
8
On a given
technology node,
Domain Specific
Architectures
deliver greater
performance on
targeted
applications
Pattern for domain specific architecturesSimpler, lower performance, more efficient processing elements
High degree of parallelism (1000s of processing elements)
Highly optimized on die data movement
New, application specific ISA
Custom compiler
9
An example of
Domain Specific
Architecture
delivering greater
performance
Optimized AI
processors deliver
even better gain
NVIDIA Volta architecture white paper
~400% perf
jump vs. 25%
for CPU
There are over 300 AI processor designs worldwide innovating from Low End to High End
10
EETimes 08.29.10 Details of Hailo AI Edge Accelerator Emerge
Hot Chips 2019, Cerebras Presentation
Claimed performance
2.8 TOPS @1.6W
~2x faster vs. CPU
1/15th power vs. CPU
11
“Cost” of IO is critical to
performance scaling and
energy consumption
The goal is improving
Bandwidth, reducing
Latency & reducing Data
movement
Improving Data
Proximity is a
vehicle to
address IO cost
Processing
Element* For simplicity single socket CPU Hosts are shown
12
Various approaches
NAND die capacity
continues to grow
enabling more
dense SSDs and
storage solutions
13
Difficult to Beat the NAND Cost Structure
SSD IO Bandwidth
is lagging SSD
Capacity growth
14
Query of very large data sets
Will take an increasingly long time
NAND stacks and
SSDs are designed for
capacity scaling
Device level aggregate
BW is ~100x vs. SSD
External IO Bandwidth
Die level Bandwidth is
~1000x
15
Moving processing into
SSD can benefit from the
high Embedded
Bandwidth
Time to Insight on very
large shardabledata will
also benefit from
Massive parallelism
9.6 TB/s aggregate untapped RAW
internal BW
Usable BW can be as much as ~25%
652 GB/s aggregate IO
BW at SSD connector
CPU
DRAM
Ho
st
54 GB/s
652 GB/s
(24x13.6 GB/s)
SSD SSD SSD
8% of Drive Bay BW
CPU
DRAM
Ho
st
SSD SSD SSD
54 GB/s
9.6 TB/s
(24x400 GB/s)
.05% of NAND BW
Potential Internal BW
Parallel Processing coupled with
Near Data Computing
enables faster time to insight
NAND
Devices
Opening up
this to
applications
Note: In both cases there are 24 drives per JBOF, identical PCIe G5 NVMe interface.
* For simplicity single socket CPU Hosts are shown
16
Near Memory Computing
can Deliver 5-10x better
Bandwidth to applications
in a 2.5D packaging
technology
17
Tighter integration
of processing and
memory can lead to
even greater
Performance at
lower Power [Kim et al., NeuroCube, ISCA 2016]
18
Near Memory Computing example
On Die integration of
compute and memory
can take advantage of
~160x on Die bandwidth
on smaller Data sets
19
Processing
Element
20
Large
investment into
improving
Platform IO
Increasing
integrated IO lane
count to grow
connectivity
Faster PCIe will
enable higher
speed devices
Silicon photonics to
extend reach21
* For simplicity single socket CPU Hosts are shown
22
Emerging Media
technologies to
improve capacity,
latency and
access methods
Emerging Memory
research is
intensifying
23
Difficult to beat
DRAM Performance
& Energy
But
DRAM density growth
is slowing
Lower
LatencyHigher
Density
Emerging
Memories can fill
data access
latency gaps and
enable new
storage models
24
Attaching EM
requires server
architecture changes
CXL is the emerging
Standard for EM
attach
25
PCIe
PCIe Gen4
IO devices
SSD
Accelerators
CXL
PCIe Gen5
2x IO Bandwidth
IO devices
SSD
Emerging Memory
Coherent Accelerators
Today ~2022
* For simplicity sake single CPU Hosts are shown
CPU
DRAM
Ho
st
CPU
DRAM
Ho
st
IO
semantics
Memory
semanticsMemory
semantics
Memory + IO
semantics
CXL enables
innovations ranging
from different
memory types to
heterogeneous
computing
26
CXL
Block
EMM
CXL
LD/ST
EMM
CXL
LD/ST
DRAM
CXL
AI Engine
NAND
CXL
Near Memory
DRAM
Some possible memory examples
Scaling Emerging
Memory capacity
is critical to address
use case
requirements
27
~TB
* For simplicity, single CPU Hosts are shown
CXL
CPU
DRAM
Ho
st
Many TB
Memory Semantics Memory Semantics
CPU
DRAM
Ho
st
54 GB/s652 GB/s
(24x13.6 GB/s)CXL expander
CXL
Emerging memory based
Expansion modules
The collection of
these new
technologies and
architectures will
impact how we
build future Data
Centers and
Systems
28
Co locating Compute
and Data requires
change in data
placement,
provisioning and
load balancing
strategies
* For simplicity single socket CPU Hosts are shown
Strict provisioning
and
data placement rules
Emergence of
non uniform
resource pools
Complex data
Partitioning models
29
Heterogeneous
Resource
Pools
Example
30
* For simplicity single socket CPU Hosts are shown
Exciting times.
The Data Growth and the need for Faster Insight drives transitioning from decades old architectures to a new, emerging model utilizing
breakthrough technologies
Summary
Buckle your seatbelt!
31