Big Data/AI and HPC Convergence Towards Post-K - Post H2020...Big Data/AI and HPC Convergence Towards Post-K Director, Riken Center for Computational Science / Professor, Tokyo Institute

Big Data/AI and HPC Convergence Towards Post-K

Director, Riken Center for Computational Science /Professor, Tokyo Institute of Technology

ACM HPDC 2018 Keynote Talk20180614

TSUBAME3.0

2006 TSUBAME1.080 TeraFlops

No.1 Asia, No.7 World10,000 cores

2010 TSUBAME2.02.4 Petarlops No1 WorldNo.1 Production GreenACM Gordon Bell Prize

2013 TSUBAME2.541１8 GPUs Upgraded

5.7 Petaflps, No.2 JapanAI Flops 17.1 Petaflops

2013 TSUBAME-KFCTSUBAME3 PrototypeOil Immersive CoolingGreen World No.1

2015 AI Prototype Upgrade (KFC/DL)

2017 TSUBAME3.0, > 10 million cores12.1 Petaflops (AI Flops 47.2 Petaflops)

Green World No1 HPC and Big Data / AI Convergence

Tokyo Tech. TSUBAME Supercomputing HistoryWorld’s Leading Supercomputer. x100,000 speedup in 17 years developing world-

leading use of massively parallel, many-core technology

2002 “TSUBAME0”1.3 TeraFlops

First “TeraScale”JP Univ. Supercomputer

800 cores

2000128 Gigaflops

CustomSupercomputer

32 cores

2000Matsuoka

GSICAppointment

2008 TSUBAME1.2170 TeraFlops

Word’s first GPUSupercomputer

General Purpose CPU & Many Core Processor (GPU), Advaned Optical Networks, Non-Volatile Memory, Efficient Power Control and Cooling

JST-CREST “Extreme Big Data” Project (2013-2018)

SupercomputersCompute&Batch-Oriented

More fragile

Cloud IDCVery low BW & EfficiencyHighly available, resilient

Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW

PCB

TSV Interposer

High Powered Main CPU

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

2Tbps HBM4~6HBM Channels1.5TB/s DRAM & NVM BW

30PB/s I/O BW Possible1 Yottabyte / Year

EBD System Softwareincl. EBD Object System

Large Scale Metagenomics

Massive Sensors and Data Assimilation in Weather Prediction

Ultra Large Scale Graphs and Social Infrastructures

Exascale Big Data HPC

Co-Design

Future Non-Silo Extreme Big Data Scientific Apps

Graph Store

EBD BagCo-Design

KVS

KVS

KVS

EBD KVS

Cartesian PlaneCo-Design

Given a top-class supercomputer, how fast can we accelerate next generation big data c.f. Clouds?

Bring HPC rigor in architectural, algorithmic, and system software performance and modeling into big data

Neglected Tropical Diseases (NTDs)

• Diseases prevalent in the tropical areas due to lack of drugs whose development is impaired owing to poor economical conditions.• World Health Organization (WHO） defines 17 NTDs• More than 1 billion infections, ½ million deaths• Some are making their way to developed countries due to global warming!

Leishmaniasis Disease（Source DNDi）

Trypanosomaforms in

a blood smear

Insect Vector

World Dengue Fever distribution (source WHO)

Japanese government promises African aid during

2013 TICAD V Meeting in Yokohama

ModeradorNotas de la presentación顧みられない熱帯病、Neglected Tropical Diseases、 NTDsですが、これは患者の経済的問題から治療薬の開発が不十分な、主に熱帯地域で蔓延する感染症の総称です。現在世界保健機関, WHOによりまして、デング熱、シャーガス病など17のNTDsが定義されており、世界では10億人以上の人々が罹患し、毎年50万人以上が死亡しております。

このNTDですが、日本国内にはほとんど患者はおりません。しかし、2008年横浜で開催された第4回アフリカ開発会議の横浜宣言で，日本はアフリカのNTDs対策への支援を表明しており、また、2013年の第5回アフリカ会議におきましても，安部総理はこの支援を継続していくことを表明しております

http://en.wikipedia.org/wiki/Blood_film

EBD vs. EBD : Large Scale Homology Search for Metagenomics[Akiyama et. al., Tokyo Tech]

increasing

Taxonomic composition

Next generation sequencer

- Revealing uncultured microbiomes and finding novel genes in various environments- Applied for human health in recent years

O(n)

Meas.data

O(m) ReferenceDatabase

O(m n) calculation

Correlation,Similarity search

EBD

・with Tokyo Dental College, Prof. Kazuyuki Ishihara

・Comparative metagenomic analysis bewtweenhealthy persons and patients

Various environments

Human body

Sea

Soil

EBD

High risk microorganisms are detected.

Metabolic Pathway

Metagenomic analysis of periodontitis patients

increasing

Development of Ultra-fast Homology Search Toolsx100,000 ~ x1,000,000 c.f. high-end BLAST WS (both FLOPS and BYTES)

1

10

100

1000

10000

100000

1000000

BLAST GHOSTZ

computational time for10,000 sequences (sec.)(3.9 GB DB、1CPU core)

Suzuki, et al. Bioinformatics, 2015.

Subsequence sequence clustering

GHOSTZ-GPU

01020304050607080

1C

1C+1

G

12C+

1G

12C+

3G

Spee

d-up

ratio

for 1

core

×70 faster than 1 coreusing 12 cores + 3 GPUs

Suzuki, et al. PLOS ONE, 2016.

Multithread on GPU MPI + OpenMP hybrid pallelization

Retaining strong scaling up to 100,000 cores

GHOST-MPKakuta, et al. (submitted)

×240 faster than conventional algorithm

TSUBAME 2.5 Thin node GPUTSUBAME 2.5

__ GHOST-MP

mpi-BLAST

×80〜×100 faster

Tokyo Tech IT-Drug Discovery MIDLSimulation & Big Data & AI at Top HPC Scale（Tonomachi, Kawasaki-city: planned 2017, PI Yutaka Akiyama）

Tokyo Tech’s research seeds①Drug Target selection system

②Glide-based Virtual Screening

③Novel Algorithms for fast virtualscreening against huge databases

New Drug Discovery platform especially forspecialty peptide and nucl. acids.

Plasma binding（ML-based）

Membrane penetration（Mol. Dynamics simulation）

N

O

N

Minister of Health, Labour and Welfare Award of the 11th annual Merit Awards for Industry-Academia-Government Collaboration

TSUBAME’s GPU-environment allowsWorld’s top-tier Virtual Screening

• Yoshino et al., PLOS ONE (2015)• Chiba et al., Sci Rep (2015)

Fragment-based efficient algorithm designed for 100-millions cmpds data

• Yanagisawa et al., GIW (2016)

Application projects

Drug Discovery platform powered by Supercomputing and Machine Learning

Investments from JP Govt., Tokyo Tech. (TSUBAME SC)Muninciple Govt (Kawasaki), JP & US Pharma

Multi-Petaflops ComputePeta~Exabytes DataProcessing Continuously

Cutting Edge, Large-Scale HPC & BD/AI Infrastructure Absolutely Necessary

ModeradorNotas de la presentaciónProf. Akiyama’s research group in School of Computing has been working on computational drug discovery.

They developed an original system named “iNTRODB” (いんとろ　でぃーびー）,which is an intelligent system for supporting “drug target selection”.The system helps users to find out good drug target proteins.

They applied this system, in drug discovery project for tropical diseases.Using iNTRODB, they selected four promising target proteins. (左下図）

Then they use our supercomputer TSUBAME for virtual screening.and finally found several promising drug candidate compounds. （右下図）

Pioneering “Big Data Assimilation” Era

Mutual feedback

High-precision Simulations

High-precision observations

Future-generation technologiesavailable 10 years in advance

EBD App2: TakemasaMiyoshi Group (Weather Forecast Application)

Big Data Assimilation for severe weather forecast

120 times more rapidthan hourly update cycles

Revolutionary super-rapid 30-sec. cycle

Goal ： Pinpoint (100-m resol.) forecast of severe local weather byupdating 30-min forecast every 30 sec!

Only in 10 minutes!

EBD System Software (Matsuoka-G)• Big Data Algorithms for Accelerators (GPU

and FPGAs, low level kernels for DNN&Graph)• Fast and Memory-saving SpGEMM on GPUs• Accelerating SpMV on GPU by Reducing Memory

Access• OpenCL-based High-Performance 3D Stencil

Computation on FPGAs• Evaluating Strategies to Accelerate Applications

using FPGAs• Accelerating Spiking Neural Networks on FPGAs• Directive-based Temporal-Blocking application

• Large Scale Graph Algorithms and Sorting• No.1 on Graph500 Benchmark, 5 consecutive

times (collab. w/Kyushu-U, Riken etc.)• Distributed Large-Scale Dynamic Graph Data

Store & Large-scale Graph Colouring (vertex coloring)

• Dynamic Graph Data Structure Using Local-NVRAM

• Incremental Graph Community Detection• ScaleGraph: Large-scale Graph Processing

Framework w/ User-Friendly Interface• GPU-HykSort: Large Scale Sorting on Massive

GPUs• XtrSort: GPU out of core sorting• Efficient Parallel Sorting Algorithm for Variable-

Length Keys

• Big-Data Performance Modeling and Analysis• Co-locating HPC and Big Data Analytics• Visualizing Traffic of Large-scale Networks• I/O vs MPI Traffic Interference on Fat-tree Networks• ibprof : Low-level Profiler of MPI Network Traffic• Evaluation of HPC-Big Data Applications in Clouds• Analysis on Configurations of Burst Buffers

• High Performance Big-Data Programming Middleware• mrCUDA: Remote-to-local GPU Migration

Middleware• Transpiler between Python and Fortran• Hamar (Highly Accelerated Map Reduce)• Out-of-core GPU-MapReduce for Large-scale Graph

Processing • DRAGON: Extending UVM to NVMe• Hierarchical, UseR-level and ON-demand File system

(HuronFS)• Optimizing Traffic Simulation App (Ex- Suzumura

Group)• Incremental Graph Community Detection• DeepGraph• Exact-Differential Traffic Simulation

20

25

30

35

40

45

15 20 25 30 35 40 45

log 2

(m)

log2(n)

USA-road-d.NY.gr

USA-road-d.LKS.gr

USA-road-d.USA.gr

Human Brain Project

Graph500 (Toy)

Graph500 (Mini)

Graph500 (Small)

Graph500 (Medium)

Graph500 (Large)

Graph500 (Huge)

1 billion nodes

1 trillion nodes

1 billion edges

1 trillion edges

Symbolic Network

USA Road Network

Twitter (tweets/day)

No. of nodes

No. of edgesK computer: 65536nodesGraph500: 17977 GTEPSThe size of graphs

Mobile Phone : SONY SO-01FSnapdragon S4 1.7GHz 4core: 2GB RAM1.03GTEPS: 235.06MTEPS/W

Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ.

Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu

List Rank GTEPS Implementation

November 2013 4 5524.12 Top-downonly

June 2014 1 17977.05 Efficient hybrid

November 2014 2 19585.2 Efficient hybrid

June, Nov 2015June Nov 2016 1 38621.4

Hybrid + Node Compression

BYTES Rich Machine + Superior BYTES

algoithm

88,000 nodes, 660,000 CPU Cores1.3 Petabyte mem20GB/s Tofu NW

≫

LLNL-IBM Sequoia1.6 million CPUs1.6 Petabyte mem

0

200

400

600

800

1000

1200

64 nodes(Scale 30)

65536 nodes(Scale 40)

Ela

pse

d T

ime

(ms)

Communicaton

73%total exec time wait in

communication

TaihuLight10 million CPUs1.3 Petabyte mem

Effective x13 performance c.f. Linpack

#1 38621.4 GTEPS(#7 10.51PF Top500)

#2 23755.7 GTEPS(#1 93.01PF Top500)

#3 23751 GTEPS(#4 17.17PF Top500)

BYTES, not FLOPS!

K-computer No.1 on Graph500: 5 Consecutive Times

• What is Graph500 Benchmark?• Supercomputer benchmark for data intensive applications.• Rank supercomputers by the performance of Breadth-First Search for very huge

graph data.

05000

1000015000200002500030000350004000045000

Jun 2012 Nov2012

Jun 2013 Nov2013

Jun 2014 Nov2014

Jul 2015 Nov2015

Jun 2016

Perf

orm

ance

(GTE

PS)

K computer (Japan)

Sequoia (U.S.A.)

Sunway TaihuLight (China)

No.1

This is achieved by a combination of high machine performance and

our software optimization.

• Efficient Sparse Matrix Representation with Bitmap

• Vertex Reordering for Bitmap Optimization• Optimizing Inter-Node Communications• Load Balancing

etc.• Koji Ueno, Toyotaro Suzumura, Naoya Maruyama, Katsuki Fujisawa, and Satoshi Matsuoka, "Efficient Breadth-First Search on

Massively Parallel and Distributed Memory Machines", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016 (to appear)

Dynamic Graphs (temporal graph)• the structure of a graph

changes dynamically over time• many real-world graphs are

classified into dynamic graph

• Most studies for large graphs have not focused on a dynamic graph data structure, but rather a static one, such as Graph 500

• Even with the large memory capacities of HPC systems, many graph applications require additional out-of-core memory (this part is still at an early stage)

Sparse Large Scale-free• social network, genome

analysis, WWW, etc.• e.g., Facebook manages

1.39 billion active users as of 2014, with more than 400 billion edges

Distributed Large-Scale Dynamic Graph Data Store Keita Iwabuchi1, 2, Scott Sallinen3, Roger Pearce2,

Brian Van Essen2, Maya Gokhale2, Satoshi Matsuoka11. Tokyo Institute of Technology (Tokyo Tech)

2. Lawrence Livermore National Laboratory (LLNL)3. University of British Columbia

Source: Jakob Enemark and Kim Sneppen, “Gene duplication models for directed networks with limits on growth”, Journal of Statistical Mechanics: Theory and Experiment 2007

Distributed Large-Scale Dynamic Graph Data Store

1. How to store dynamic graphs into local memory?(high speed graph update and look up)

2. How to extend to distributed-memory platforms?(efficient communication for real-time processing)

TOKYO INSTITUTE OF TECHNOLOGY This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-PRES-726942).

0

50

100

1 2 4 8 16 32 64Compute Nodes

Dynamic graph colouring (in-core) [SC’16]

Edge

Inse

rts/

s (m

illio

ns)

DegAwareRHH1. Leverages a linear probing hash table to increase sequential locality while

keeping dynamic graph update performance2. Adopts an async communication framework to localize communication

0

100

200

6 12 24

Spee

d U

p

Parallels (threads/processes)

Dynamic graph constructionagainst STINGER (single-node, in-core) PowerGraph (static processing only)

DegAwareRHH

Performance comparison against state-of-the-art works (STINGER, PowerGraph)

↑2.1X

↑200X

Large-scale Graph Processing Framework w/ User-Friendly Interface

• ScaleGraph• X10-based open source Highly Scalable Large Scale Graph Analytics Library

beyond the scale of billions of vertices and edges on Distributed Systems• XPregel: Pregel-based bulk synchronous parallel graph processing framework• Built-in graph algorithms (Centrality, Connected Component, Clustering, etc.)

• Python Interface • Allow users to use ScaleGraph with Spark* by easy python interface

Software stack

XPregel(Graph Processing System)

ScaleGraphBase Library

MPI

Graph Algorithm

X10 Standard Lib

X10Sparse Matrix

BLASFile IO

User Program

Third Party Library(ARPACK, METIS)X10 & C++ Team

*Apache Spark: http://spark.apache.org/

User Python Script

Cluster

Spark(RDD)

HDFS

ScaleGraph

Modern AI is enabled by Supercomputing• 25 years of AI winter after failure of symbolic logic based methods

(e.g., Prolog, ICOT) -> resurrection by DNN, basic algorithms in the 1980s but too expensive -> HPC made machines 10 million times faster in 30 years -> expensive training now possible

• Recent trends require more supercomputing power– Deeper, more complex networks (Capsule Networks)– Complex, multidimensional data (e.g., 3-D Hi-Res images)– Increasing training sets (incl. GANs)– Coupling with high-fidelity simulations– Etc.

Fig. 2: Andrew Ng (Baidu) “What Data Scientists ShouldKnow about Deep Learning”

4 Layers of Parallelism in DNN Training well supported in Post-K• Hyper Parameter Search

• Searching optimal network configs & parameters• Parallel search, massive parallelism required

• Data Parallelism• Copy the network to compute nodes, feed different batch data,

average => network reduction bound• TOFU: Extremely strong reduction, x6 EDR Infiniband

• Model Parallelism (domain decomposition)• Split and parallelize the layer calculations in propagation • Low latency required (bad for GPU) -> strong latency tolerant

cores + low latency TOFU network

• Intra-Chip ILP, Vector and other low level Parallelism• Parallelize the convolution operations etc. • SVE FP16+INT8 vectorization support + extremely high memory

bandwidth w/HBM2

• Post-K could become world’s biggest & fastest platform for DNN training!

19

Deep Learning is “All about Scale”Massive Parallelization is the key

• Data-parallel training with (Asynchronous)Stochastic Gradient Descent

– Replicate network to all the nodes, feed different data, average the gradients periodically

– Network All-Reduce Reduction in Megabytes~Gigabytes becomes the bottleneck at scale

– NVIDIA: NVLink Hardware + NICL library (up to 8 GPUs on DGX-1, 16 on DGX-2 w/ NVL Switch)

June 24, 2018 Jens Domke20

Fig. 2: Andrew Ng (Baidu) “What Data Scientists ShouldKnow about Deep Learning”

Fig. 3: Simplified DL workflow with ASGD per iteration:1. Compute gradient2. Exchange gradients via all-reduce; and3. Update network parameters

Example AI Research: Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers

Background

• In large-scale Asynchronous Stochastic Gradient Descent (ASGD), mini-batch size and gradient staleness tend to be large and unpredictable, which increase the error of trained DNN

Objective function E

W(t)-ηΣi ∇Ei

W(t+1)W(t+1)

-ηΣi ∇Ei

W(t+3)

W(t+2)

Twice asynchronous updates within

gradient computation

Staleness=0

Staleness=2

DNN parameters space

Mini-batch size

(NSubbatch: # of samples per one GPU iteration)

Mini-batch size Staleness

Measured

Predicted

4 nodes8 nodes

16 nodes MeasuredPredicted

Proposal• We propose a empirical performance model for an ASGD

deep learning system SPRINT which considers probability distribution of mini-batch size and staleness

• Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016

Interconnect Performance as important as GPU Performance to accelerate DL

• ASGD DL system SPRINT (by DENSO IT Lab) and DL speedup predictionwith performance model

– Data measured on T2 and KFC(both FDR) fitted to formulas

– Allreduce time (∈ TGPU) dep. on#nodes and #DL_parameters

• Other approaches == similar improvements:– Cuda-Aware CNTK optimizes communication pipeline 15%—23% speedup

(Banerjee et al. “Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters”)

– Reduced precision (FP[16|8|1]) to minimize msg. size w/ no or minor accuracy loss

Fig. 4: Oyama et al. “Predicting Statistics of Asynchronous SGD Parameters for aLarge-Scale Distributed Deep Learning System on GPU Supercomputers

Fast and cost-effective deep learning algorithm platform for video processing in social infrastructure

Principal Investigator: Koichi ShinodaCollaborators: Satoshi Matsuoka

Tsuyoshi MurataRio Yokota

Tokyo Institute of Technology(Members RWBC-OIL 1-1 and 2-1)

JST-REST “Development and Integration of Artificial Intelligence Technologies for Innovation Acceleration”

Our Deep Learning Scaling achievement in 2017

Component Speed Memory

Compute node x7.4 (x50) 1/15(1/10)

Parallelization x11.6*(x10) 2*(1/10)

LearningAlgorithm x11.6*(x10) 2*(1/10)

Downsizing 1/90(1/100)

Total > x1000 < 1/1000

24

: Achievement obtained by the joint work of the two groups*

DEEP LEARNING FOR SCIENCE: FUSION ENERGY SCIENCE[Prof. William Tang, Princeton Univ]

Most critical problem for Fusion Energy: avoid/mitigate large-scale major disruptions •Approach: Use of big-data-driven machine-learning (ML) predictions for the occurrence of disruptions in EUROFUSION “Joint European Torus (JET)”, DIII-D (US) & other tokomaks worldwide.•Recent Status: First principle simulation not possible. ~ 8 years of R&D (led by JET) using Support Vector Machine (SVM) ML on zero-D time trace data on CPU clusters yielding ~ reported success rates in mid-80% range for JET 30 ms before disruptions , BUT > 95% with false alarm rate < 5% actually needed for ITER (P. DeVries, et al. (2015) •Princeton Team Machine Learning Goals:(i)improve physics fidelity via development of new

ML multi-D, time-dependent software including better classifiers; (ii)develop “portable” (cross-machine) predictive software beyond JET to other devices and eventually ITER; and (iii)enhance accuracy & speed of disruption analysis for very large datasets via HPC development & deployment of advanced ML software via Deep Learning/AI Neural Networks (both Convolutional & Recurrent) in Princeton’s “Fusion Recurrent Neural Net (FRNN) Code

(JET)

FRNN DL/AI software reliably scales to 1K P-100 GPU’s on TSUBAME 3.0 associated production runs contributing strongly to Hyperpameter-

Tuning-enabled physics advances ! )Recent results: TSUBAME 3.0 supercomputer (Tokyo Tech)

Tsubame 3.0 “Grand Challenge Runs” (A. Svyatkovskii, Princeton U)–Order of thousand Tesla P100 SXM2 GPUs, 4 GPUs per node, NVlink–Tensorflow+MPI, CUDA8, CuDNN 6, OpenMPI 2.1.1, GPU Direct

Cross Machine Prediction (DIII-D to JET)

Train (DIII-D)

RNN 0D & RNN 1D ~0.80XGBoost (shallow) 0.62

Test (JET)

Overview of TSUBAME3.0BYTES-centric Architecture, Scalaibility to all 2160 GPUs,

all nodes, the entire memory hiearchy

Full Bisection BandwidghIntel Omni-Path Interconnect. 4 ports/nodeFull Bisection / 432 Terabits/s bidirectional~x2 BW of entire Internet backbone traffic

DDN Storage(Lustre FS 15.9PB+Home 45TB)

540 Compute Nodes SGI ICE XA + New BladeIntel Xeon CPU x 2+NVIDIA Pascal GPUx4 (NV-Link)

256GB memory 2TB Intel NVMe SSD47.2 AI-Petaflops, 12.1 Petaflops

Full Operations Aug. 2017

TSUBAME3.0 Co-Designed SGI ICE-XA Blade (new)- No exterior cable mess (power, NW, water)- Plan to become a future HPE product

TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC

29

Intra-node GPU via NVLink20~40GB/s


Inter-node GPU via OmniPath12.5GB/s fully switched

HBM2 64GB2.5TB/s

DDR4256GB 150GB/s

Intel Optane1.5TB 12GB/s(planned)

NVMe Flash2TB 3GB/s

16GB/s PCIeFully Switched


~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node) Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year

Terabit class network/node800Gbps (400+400)

full bisection

Any “Big” Data in the system can be moved

to anywhere via RDMA speeds

minimum 12.5GBytes/s

also with Stream Processing

Scalable to all 2160 GPUs, not just 8

TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC

30



Inter-node GPU via OmniPath12.5GB/s fully switched

HBM2 64GB2.5TB/s

DDR4256GB 150GB/s

Intel Optane1.5TB 12GB/s(planned)

NVMe Flash2TB 3GB/s



~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node) Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year

Any “Big” Data in the system can be moved

to anywhere via RDMA speeds

minimum 12.5GBytes/s

also with Stream Processing

Scalable to all 2160 GPUs, not just 8

Power Meters

Tokyo Tech / HPEBenchmarking Team

Award Ceremony at ISC2017 @ Frankfurt

TSUBAME3.0 became the first large production petaflops-scale supercomputer in the world to be #1 on the “Green500” power efficiency Wworld ranking of supercomputers

14.1 Gigaflops/W is more than x10 more efficient than PCs and Smartphones!

0 10 20 30 40 50 60 70

Riken

U-Tokyo

Tokyo Tech

Site Comparisons of AI-FP Perfs

TSUBAME3.0 T2.5

K

Oakforest-PACS (JCAHPC)

Reedbush(U&H)

PFLOPS

DFP 64bit SFP 32bit HFP 16bit

Simulation

Computer Graphics

Gaming

Big Data

Machine Learning / AI

65.8 Petaflops

Tokyo Tech GSIC leads Japan in aggregated AI-capable FLOPS TSUBAME3+2.5+KFC, in all Supercomuters and CloudsNV

T-KFC

~6700 GPUs + ~4000 CPUs

GFLO

PS

Matrix Dimension (m=n=k)

0

2000

4000

6000

8000

10000

12000

14000

16000

0 500 1000 1500 2000 2500 3000 3500 4000 4500

P100-fp16 P100 K40

NVIDIA Pascal P100 DGEMM Performane

Tremendous Recent Rise in Interest by the Japanese Government on Big Data, DL, AI, and IoT

• Three national centers on Big Data and AI launched by three competing Ministries for FY 2016 (Apr 2015-)

• MEXT – AIP (Artificial Intelligence Platform): Riken and other institutions ($~50 mil), April 2016

• A separate Post-K related AI funding as well.• Focused on basic research, Machine Learning Algorithms

• METI – AIRC (Artificial Intelligence Research Center): AIST (AIST internal budget + > $200 million FY 2017), April 2015

• Broad AI/BD/IoT, industry focus

• MOST – Universal Communication Lab: NICT ($50~55 mil)• Brain –related AI

• $1 billion commitment on inter-ministry AI research over 10 years• All 3 centers now invest in or work with top-tier supercomputing

Vice MinsiterTsuchiya@MEXTAnnoucing AIP estabishment

The current status of AI & Big Data in JapanWe need the triage of advanced algorithms/infrastructure/data but we lack the cutting edge infrastructure dedicated to AI & Big Data (c.f. HPC)

R&D MLAlgorithms

& SW

AI&DataInfrastructures

“Big”Data

B

IoT Communication, location & other data

Petabytes of DriveRecording Video

FA&Robots

Web access andmerchandice

Use of Massive Scale Data now Wasted

Seeking Innovative Application of AI & Data

AI Venture StartupsBig Companies AI/BD R&D (also Science)

In HPC, Cloud continues to be insufficient for cutting edge research => dedicated SCs dominate & racing to Exascale

Massive Rise in ComputingRequirements (1 AI-PF/person?)

Massive “Big” Data in Training

Riken -AIP

Joint RWBC Open Innov. Lab (OIL)(Director: Matsuoka)

AIST-AIRC

NICT-UCRI

Over $1B Govt.AI investmentover 10 years

AI/BD Centers & Labs in National Labs & Universities

METI AIST-AIRC ABCIas the worlds first large-scale OPEN AI Infrastructure

35

Univ. Tokyo Kashiwa Campus

• >550 AI-Petaflops• < 3MW Power• < 1.1 Avg. PUE• Operational Summer

2018

• ABCI: AI Bridging Cloud Infrastructure• Top-Level SC compute & data capability for DNN (550 AI-Petaflops)• Open Public & Dedicated infrastructure for Al & Big Data Algorithms,

Software and Applications – OPEN SOURCING AI DATACENTER

• Platform to accelerate joint academic-industry R&D for AI in Japan

ABCI: AI Bridging Cloud Infrastructure

• Open, Public, and Dedicated infrastructure for Al & Big Data Algorithms, Software, and Applications

• Open Innovation Platform to accelerate joint academic-industry R&D for AI, international collaborations are also welcome

• Top-level compute capability: 0.55 EFLOPS (DL), 37 PFLOPS (DP)• Top-level energy efficiency: lower PUE• All design and implementations will be open-sourced

36

UniversitiesResearch Institutes

Companies Open Innovation Platform for AI R&DManufacturing

Autonomouscars

Edge AI

ABCI Procurement Benchmarks

• Big Data Benchmarks– (SPEC CPU Rate)– Graph 500– MinuteSort– Node Local Storage I/O– Parallel FS I/O

• AI/ML Benchmarks– Low precision GEMM

• CNN Kernel, defines “AI-Flops”

– Single Node CNN• AlexNet and GoogLeNet• ILSVRC2012 Dataset

– Multi-Node Scalable CNN• Caffe+MPI

– Large Memory CNN• Convnet on Chainer

– RNN / LSTM• Neural Machine Translation on

Torch

37

No traditional HPC Simulation Benchmarks

except SPEC CPU.Plan on “open-sourcing”

ABCI Computing Node

38

FUJITSU PRIMERGY Server (2 servers in 2U)CPU Intel Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) x2GPU NVIDIA Tesla V100 (SXM2) x4Memory 384GiBLocal Storage 1.6TB NVMe SSD (Intel SSD DC P4600 u.2) x1Interconnect InfiniBand EDR x2

Xeon Gold 6148

Xeon Gold 6148

10.4GT/s x3DDR4-266632GB x 6

DDR4-266632GB x 6

128GB/s 128GB/s

IB HCA (100Gbps)IB HCA (100Gbps)

NVMe

UPI x3

x48 switch x64 switch

Tesla V100 SXM2 Tesla V100 SXM2

Tesla V100 SXM2 Tesla V100 SXM2

PCIe gen3 x16 PCIe gen3 x16

PCIe gen3 x16 PCIe gen3 x16

NVLink2 x2

Software and Services

39

Operating System CentOS, RHEL

Job Scheduler Univa Grid Engine

Container Engine Docker, Singularity

MPI OpenMPI, MVAPICH

Development tools Intel Parallel Studio XE Cluster Edition, PGI Professional Edition, Python, Ruby, R, Java, Scala, Perl

Deep Learning Caffe, Caffe2, TensorFlow, Theano, Torch, PyTorch, CNTK, MXnet, Chainer, Keras, etc.

Resource Alloc.

Job Bootstrap

Applications

Univa Resource Allocator

Singularity

Native / user-installed software

Univa Docker

System-provided / User-defined containers

CampaignService Type

■Software (Natively installed)

■Service Types and Container Support

Still under design

On-Demand Spot

Interactive use Batch use

Reserved

Advance reservation

Direct run

ABCI Datacenter Overview

• Single floor, inexpensive build

• Hard concrete floor with 2 tons/m2 weight tolerance

• Racks• 144 racks max.• ABCI uses 43 racks

• Power capacity• 3.25 MW max.• ABCI uses 2.3MW max.

• Water-Air Hybrid Cooling• Water Block: 60kW/Rack• Fan Coil Unit: 10kW/Rack• Total: 3.2MW min. (summer)

Commoditizing supercomputer cooling technologies to Cloud (70kW/rack)

Comparing TSUBAME3/ABCI to Classical IDCAI IDC CAPX/OPEX accelerartion by > x100

41

Traditional Xeon IDC~10KW/rack PUE 1.5~215~20 1U Xeon Servers

2 Tera AI-FLOPS(SFP) / server30~40 Tera AI-FLOP / rack

TSUBAME3 (+Volta) & ABCI IDC~70KW/rack PUE 1.0x

~36 T3 evolution servers~500 Tera AI-FLOPS(HFP) / server

~17 Peta AI-FLOPs / rack

Perf > 400~600Power Eff > 200~300

Japan Flagship 2020 “Post K” Supercomputer

42

: Compute Node

:Interconnect

LoginServers

MaitenanceServers

I/O Network…

……

………………………

HierarchicalStorage System

PortalServers

CPU• Many core, Xeon-Class ARM v8 cores + 512

bit SVE (scalable vector extensions)• Multi-hundred petaflops peak total• Power Knob feature

Memory3-D stacked DRAM, Terabyte/s BW /chip

Interconnect• TOFU3 CPU-integrated 6-D torus network

• I/O acceleration with massive SDs• 30MW+ Power, liquid cooled• Riken co-design with Fujitsu•? Million cores in system

Prime Minister Abe visiting K Computer 2013

• R

NOW

43

1. Heritage of the K-Computer, HP in simulation via extensive Co-Design• High performance: up to x100 performance of K in real applications• Multitudes of Scientific Breakthroughs via Post-K application programs• Simultaneous high performance and ease-of-programming

2. New Technology Innovations of Post-K• High Performance, esp. via high memory BW

Performance boost by “factors” c.f. mainstream CPUs in many HPC & Society5.0 apps

• Very Green e.g. extreme power efficiencyUltra Power efficient design & various power control knobs

• Arm Global Ecosystem & SVE contributionARM Ecosystem: 21 billion chips/year, SVE co-design and world’s first implementation by Fujitsu, to become global std.

• High Perf. on Society5.0 apps incl. AIArchitectural features for high perf on Society 5.0 apps based on Big Data, AI/ML, CAE/EDA, Blockchainsecurity, etc.

Post-K: The Game Changer

ARM: Massive ecosystem from embedded to HPC

Global leadership not just in the machine & apps, but as cu

tting edge IT

Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds

C P UFor the

Post-Ksupercomputer

Post-K CPU New Innnovations: Summary

44

1. Ultra high bandwidth using on-package memory & matching CPU core Recent studies show that majority of apps are memory bound, some compute bound

but can use lower precision e.g. FP16 Comparison w/mainstream CPU: much faster FPU, almost order magnitude faster

memory BW, and ultra high performance accordingly Memory controller to sustain massive on package memory (OPM) BW: difficult for

coherent memory CPU, first CPU in the world to support OPM2. Very Green e.g. extreme power efficiency

Power optimized design, clock gating & power knob, efficient cooling Power efficiency much better than CPUs, comparable to GPU systems

3. Arm Global Ecosystem & SVE contribution Annual processor production: x86 3-400mil, ARM 21bil, (2~3 bilhigh end) Rapid upbringing HPC&IDC Ecosystem(e.g. Cavium, HPE, Sandia, Bristol,…) SVE(Scalable Vector Extension) -> Arm-Fujitsu co-design, future global std.

3. High Performance on Society5.0 apps including AI Next gen AI/ML requires massive speedup => high perf chips + HPC massive

scalability across chips Post-K processor: support for AI/ML acceleration e.g. Int8/FP16+fast memory for

GPU-class convolution, fast interconnect for massive scaling Top performance in AI as well as other Society 5.0 apps

Post K Processor is… an Many-Core ARM CPU…

48 compute cores + 2 or 4 assistant (OS) cores

Near Xeon-Class performance per core

ARM V8 ---64bit ARM ecosystem

Tofu 3 + PCIe 3 external connection

…but also a GPU-like processor

SVE 512 bit vector extensions

Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes)

Cache + scratchpad local memory (sector cache)

Multi-stack 3D memory – TB/s level Mem BW, limited capacity Various features for streaming memory access, stridedaccess, etc.

Intra-chip barrier synch. and other memory enhancing features

GPU-like High performance in HPC, AI/Big Data, Blockchain…20018/3/13 45

46

Post-K Processor◆High perf FP16&Int8◆High mem BW for convolution◆Built-in scalable Tofu network

Unprecedened DL scalability

High Performance DNN Convolution

Low Precision ALU + High Memory Bandwidth + Advanced Combining of Convolution Algorithms (FFT+Winograd+GEMM)

High Performance and Ultra-Scalable Networkfor massive scaling model & data parallelism

Unprecedented Scalability of Data/

Massive Scale Deep Learning on Post-K

C P UFor the

Post-Ksupercomputer

C P UFor the

Post-Ksupercomputer

C P UFor the

Post-Ksupercomputer

C P UFor the

Post-Ksupercomputer

TOFU Network w/high injection BW for fast reduction

ModeradorNotas de la presentación京：中性能高精度演算GPUは高い、かつ、大規模にスケールするネットワークが弱い（最新で16ネットワーク）

Evaluation: WD using Integer LP

A desirable configuration set of AlexNet conv2 (Forward)Mini-batch size of 256, P100-SXM2

Each bar represents proportion of micro-batch sizes and algorithms

Evaluation: WR using Dynamic Programming

• μ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2

cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SXM2Workspace size of 64 MiB, mini-batch size of 256

Numbers on each rectangles represent micro-batch sizes

Selecting the Optimal Convolution Kernel

47

NEW! Micro Batching: Tokyo Tech. and ETH [Oyama, Tan, Hoefler & Matsuoka] Use the “micro-batch” technique to select the

best convolution kernel Direct, GEMM, FFT, Winograd Optimize both speed and memory size

On high-end GPUs, in many cases Winogrador FFT chosen over GEMM They are faster but use more memory

Currently implemented as cuDNNwrapper, applicable to all frameworks

For Post-K, (1) Winograd/FFT are selected more often, and (2) performance will be similar to GPUs in such cases

Evaluation: WD using Integer LP

A desirable configuration set of AlexNet conv2 (Forward)

Mini-batch size of 256, P100-SXM2

Each bar represents proportion of micro-batch sizes and algorithms

1

●

●

●●●●●●●

●●

●

●

●● ●

0 2 4 6 8 10

020

4060

8010

012

0

Execution time [ms]

Wor

kspa

ce s

ize [M

iB]

IMPLICIT_GEMMIMPLICIT_PRECOMP_GEMMGEMMFFTFFT_TILINGWINOGRAD_NONFUSED

Evaluation: WR using Dynamic Programming

μ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2

cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SXM2

Workspace size of 64 MiB, mini-batch size of 256

Numbers on each rectangles represent micro-batch sizes

2.33x

2.09x

1

1

undivided powerOfTwo all

Exec

utio

n tim

e [m

s]

01

23

45

6 IMPLICIT_PRECOMP_GEMMFFT_TILINGWINOGRAD_NONFUSED

256

3232323232323232

323248484848

Large Scale simulation and AI coming together[Ichimuraet. al. Univ. of Tokyo, IEEE/ACM SC17 Best Poster]

130 billion freedom earthquake of entire Tokyo on K-Computer (ACM Gordon Bell Prize Finalist, SC16,17 Best Poster)

48Too Many InstancesEarthquake

Soft Soil

Oct. 2015TSUBAME-KFC/DL(Tokyo Tech./NEC)1.4 AI-PF(Petaflops)

Cutting Edge Research AI Infrastructures in JapanAccelerating BD/AI with HPC (w/accompanying BYTES)(and my effort to design & build them)

Mar. 2017AIST AI Cloud(AIST-AIRC/NEC)8.2 AI-PF

Mar. 2017AI SupercomputerRiken AIP/Fujitsu4.1 AI-PF

Aug. 2017TSUBAME3.0 (Tokyo Tech./HPE)47.2 AI-PF (65.8 AI-PF w/Tsubame2.5)

In Production

In Production

In Production Aug 2018ABCI (AIST-AIRC)550 AI-Petaflops

AI-ExaFlop era

2020 Post-K Multi AI-Exaflops

order of magnitude over ABCI

R&D Investments into world leading AI/BD HW & SW & Algorithms and their co-design for cutting edge Infrastructure absolutely necessary (just as is with Japan Post-K and US ECP in HPC)

x5.8

x5.8

x11.7

X4~6?In Preparation

Big Data/AI and HPC Convergence Towards Post-KTokyo Tech. TSUBAME Supercomputing History�World’s Leading Supercomputer. x100,000 speedup in 17 years developing world-leading use of massively parallel, many-core technology�JST-CREST “Extreme Big Data” Project (2013-2018)Neglected Tropical Diseases (NTDs) Número de diapositiva 5Número de diapositiva 6Número de diapositiva 7Pioneering “Big Data Assimilation” EraEBD App2: Takemasa Miyoshi Group (Weather Forecast Application)9/11/2014, sudden local rainEBD System Software (Matsuoka-G)Número de diapositiva 12Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4� K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS, FujitsuK-computer No.1 on Graph500: 5 Consecutive TimesNúmero de diapositiva 15Distributed Large-Scale Dynamic Graph Data Store Large-scale Graph Processing Framework �w/ User-Friendly Interface Modern AI is enabled by Supercomputing4 Layers of Parallelism in DNN Training well supported in Post-KDeep Learning is “All about Scale”�Massive Parallelization is the keyExample AI Research: Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU SupercomputersInterconnect Performance as important as GPU Performance to accelerate DLFast and cost-effective deep learning algorithm platform for video processing in social infrastructure Our Deep Learning Scaling achievement in 2017Número de diapositiva 25Número de diapositiva 26Overview of TSUBAME3.0�BYTES-centric Architecture, Scalaibility to all 2160 GPUs, all nodes, the entire memory hiearchyTSUBAME3.0 Co-Designed SGI ICE-XA Blade (new)�- No exterior cable mess (power, NW, water)�- Plan to become a future HPE productTSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPCTSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPCNúmero de diapositiva 31浮動小数演算の精度Tremendous Recent Rise in Interest by the Japanese Government on Big Data, DL, AI, and IoTNúmero de diapositiva 34METI AIST-AIRC ABCI�as the worlds first large-scale OPEN AI InfrastructureABCI: AI Bridging Cloud InfrastructureABCI Procurement BenchmarksABCI Computing NodeSoftware and ServicesABCI Datacenter OverviewComparing TSUBAME3/ABCI to Classical IDC�AI IDC CAPX/OPEX accelerartion by > x100Japan Flagship 2020 “Post K” SupercomputerPost-K: The Game ChangerPost-K CPU New Innnovations: SummaryPost K Processor is…Massive Scale Deep Learning on Post-KSelecting the Optimal Convolution KernelLarge Scale simulation and AI coming together�[Ichimura et. al. Univ. of Tokyo, IEEE/ACM SC17 Best Poster]�Número de diapositiva 49

Documents

Big Data/AI and HPC Convergence Towards Post-K - Post H2020...Big Data/AI and HPC Convergence Towards Post-K Director, Riken Center for Computational Science / Professor, Tokyo Institute