View
3
Download
0
Category
Preview:
Citation preview
Big Data/AI and HPC Convergence Towards Post-K
Director, Riken Center for Computational Science /Professor, Tokyo Institute of Technology
ACM HPDC 2018 Keynote Talk20180614
TSUBAME3.0
2006 TSUBAME1.080 TeraFlops
No.1 Asia, No.7 World10,000 cores
2010 TSUBAME2.02.4 Petarlops No1 WorldNo.1 Production GreenACM Gordon Bell Prize
2013 TSUBAME2.54118 GPUs Upgraded
5.7 Petaflps, No.2 JapanAI Flops 17.1 Petaflops
2013 TSUBAME-KFCTSUBAME3 PrototypeOil Immersive CoolingGreen World No.1
2015 AI Prototype Upgrade (KFC/DL)
2017 TSUBAME3.0, > 10 million cores12.1 Petaflops (AI Flops 47.2 Petaflops)
Green World No1 HPC and Big Data / AI Convergence
Tokyo Tech. TSUBAME Supercomputing HistoryWorld’s Leading Supercomputer. x100,000 speedup in 17 years developing world-
leading use of massively parallel, many-core technology
2002 “TSUBAME0”1.3 TeraFlops
First “TeraScale”JP Univ. Supercomputer
800 cores
2000128 Gigaflops
CustomSupercomputer
32 cores
2000Matsuoka
GSICAppointment
2008 TSUBAME1.2170 TeraFlops
Word’s first GPUSupercomputer
General Purpose CPU & Many Core Processor (GPU), Advaned Optical Networks, Non-Volatile Memory, Efficient Power Control and Cooling
JST-CREST “Extreme Big Data” Project (2013-2018)
SupercomputersCompute&Batch-Oriented
More fragile
Cloud IDCVery low BW & EfficiencyHighly available, resilient
Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW
PCB
TSV Interposer
High Powered Main CPU
Low Power CPU
DRAMDRAMDRAMNVM/Flash
NVM/Flash
NVM/Flash
Low Power CPU
DRAMDRAMDRAMNVM/Flash
NVM/Flash
NVM/Flash
2Tbps HBM4~6HBM Channels1.5TB/s DRAM & NVM BW
30PB/s I/O BW Possible1 Yottabyte / Year
EBD System Softwareincl. EBD Object System
Large Scale Metagenomics
Massive Sensors and Data Assimilation in Weather Prediction
Ultra Large Scale Graphs and Social Infrastructures
Exascale Big Data HPC
Co-Design
Future Non-Silo Extreme Big Data Scientific Apps
Graph Store
EBD BagCo-Design
KVS
KVS
KVS
EBD KVS
Cartesian PlaneCo-Design
Given a top-class supercomputer, how fast can we accelerate next generation big data c.f. Clouds?
Bring HPC rigor in architectural, algorithmic, and system software performance and modeling into big data
Neglected Tropical Diseases (NTDs)
• Diseases prevalent in the tropical areas due to lack of drugs whose development is impaired owing to poor economical conditions.• World Health Organization (WHO) defines 17 NTDs• More than 1 billion infections, ½ million deaths• Some are making their way to developed countries due to global warming!
Leishmaniasis Disease(Source DNDi)
Trypanosomaforms in
a blood smear
Insect Vector
World Dengue Fever distribution (source WHO)
Japanese government promises African aid during
2013 TICAD V Meeting in Yokohama
ModeradorNotas de la presentación顧みられない熱帯病、Neglected Tropical Diseases、 NTDsですが、これは患者の経済的問題から治療薬の開発が不十分な、主に熱帯地域で 蔓延する感染症の総称です。現在世界保健機関, WHOによりまして、デング熱、シャーガス病など17のNTDsが定義されており、世界では10億人以上の人々が罹患し、毎年50万人以上が死亡しております。
このNTDですが、日本国内にはほとんど患者はおりません。しかし、2008年横浜で開催された第4回アフリカ開発会議の横浜宣言で,日本はアフリカのNTDs対策への支援を表明しており、また、2013年の第5回アフリカ会議におきましても,安部総理はこの支援を継続していくことを表明しております
http://en.wikipedia.org/wiki/Blood_film
EBD vs. EBD : Large Scale Homology Search for Metagenomics[Akiyama et. al., Tokyo Tech]
increasing
Taxonomic composition
Next generation sequencer
- Revealing uncultured microbiomes and finding novel genes in various environments- Applied for human health in recent years
O(n)
Meas.data
O(m) ReferenceDatabase
O(m n) calculation
Correlation,Similarity search
EBD
・with Tokyo Dental College, Prof. Kazuyuki Ishihara
・Comparative metagenomic analysis bewtweenhealthy persons and patients
Various environments
Human body
Sea
Soil
EBD
High risk microorganisms are detected.
Metabolic Pathway
Metagenomic analysis of periodontitis patients
increasing
Development of Ultra-fast Homology Search Toolsx100,000 ~ x1,000,000 c.f. high-end BLAST WS (both FLOPS and BYTES)
1
10
100
1000
10000
100000
1000000
BLAST GHOSTZ
computational time for10,000 sequences (sec.)(3.9 GB DB、1CPU core)
Suzuki, et al. Bioinformatics, 2015.
Subsequence sequence clustering
GHOSTZ-GPU
01020304050607080
1C
1C+1
G
12C+
1G
12C+
3G
Spee
d-up
ratio
for 1
core
×70 faster than 1 coreusing 12 cores + 3 GPUs
Suzuki, et al. PLOS ONE, 2016.
Multithread on GPU MPI + OpenMP hybrid pallelization
Retaining strong scaling up to 100,000 cores
GHOST-MPKakuta, et al. (submitted)
×240 faster than conventional algorithm
TSUBAME 2.5 Thin node GPUTSUBAME 2.5
__ GHOST-MP
mpi-BLAST
×80〜×100 faster
Tokyo Tech IT-Drug Discovery MIDLSimulation & Big Data & AI at Top HPC Scale(Tonomachi, Kawasaki-city: planned 2017, PI Yutaka Akiyama)
Tokyo Tech’s research seeds①Drug Target selection system
②Glide-based Virtual Screening
③Novel Algorithms for fast virtualscreening against huge databases
New Drug Discovery platform especially forspecialty peptide and nucl. acids.
Plasma binding(ML-based)
Membrane penetration(Mol. Dynamics simulation)
N
O
N
Minister of Health, Labour and Welfare Award of the 11th annual Merit Awards for Industry-Academia-Government Collaboration
TSUBAME’s GPU-environment allowsWorld’s top-tier Virtual Screening
• Yoshino et al., PLOS ONE (2015)• Chiba et al., Sci Rep (2015)
Fragment-based efficient algorithm designed for 100-millions cmpds data
• Yanagisawa et al., GIW (2016)
Application projects
Drug Discovery platform powered by Supercomputing and Machine Learning
Investments from JP Govt., Tokyo Tech. (TSUBAME SC)Muninciple Govt (Kawasaki), JP & US Pharma
Multi-Petaflops ComputePeta~Exabytes DataProcessing Continuously
Cutting Edge, Large-Scale HPC & BD/AI Infrastructure Absolutely Necessary
ModeradorNotas de la presentaciónProf. Akiyama’s research group in School of Computing has been working on computational drug discovery.
They developed an original system named “iNTRODB” (いんとろ でぃーびー),which is an intelligent system for supporting “drug target selection”.The system helps users to find out good drug target proteins.
They applied this system, in drug discovery project for tropical diseases.Using iNTRODB, they selected four promising target proteins. (左下図)
Then they use our supercomputer TSUBAME for virtual screening.and finally found several promising drug candidate compounds. (右下図)
Pioneering “Big Data Assimilation” Era
Mutual feedback
High-precision Simulations
High-precision observations
Future-generation technologiesavailable 10 years in advance
EBD App2: TakemasaMiyoshi Group (Weather Forecast Application)
Big Data Assimilation for severe weather forecast
120 times more rapidthan hourly update cycles
Revolutionary super-rapid 30-sec. cycle
Goal : Pinpoint (100-m resol.) forecast of severe local weather byupdating 30-min forecast every 30 sec!
Only in 10 minutes!
EBD System Software (Matsuoka-G)• Big Data Algorithms for Accelerators (GPU
and FPGAs, low level kernels for DNN&Graph)• Fast and Memory-saving SpGEMM on GPUs• Accelerating SpMV on GPU by Reducing Memory
Access• OpenCL-based High-Performance 3D Stencil
Computation on FPGAs• Evaluating Strategies to Accelerate Applications
using FPGAs• Accelerating Spiking Neural Networks on FPGAs• Directive-based Temporal-Blocking application
• Large Scale Graph Algorithms and Sorting• No.1 on Graph500 Benchmark, 5 consecutive
times (collab. w/Kyushu-U, Riken etc.)• Distributed Large-Scale Dynamic Graph Data
Store & Large-scale Graph Colouring (vertex coloring)
• Dynamic Graph Data Structure Using Local-NVRAM
• Incremental Graph Community Detection• ScaleGraph: Large-scale Graph Processing
Framework w/ User-Friendly Interface• GPU-HykSort: Large Scale Sorting on Massive
GPUs• XtrSort: GPU out of core sorting• Efficient Parallel Sorting Algorithm for Variable-
Length Keys
• Big-Data Performance Modeling and Analysis• Co-locating HPC and Big Data Analytics• Visualizing Traffic of Large-scale Networks• I/O vs MPI Traffic Interference on Fat-tree Networks• ibprof : Low-level Profiler of MPI Network Traffic• Evaluation of HPC-Big Data Applications in Clouds• Analysis on Configurations of Burst Buffers
• High Performance Big-Data Programming Middleware• mrCUDA: Remote-to-local GPU Migration
Middleware• Transpiler between Python and Fortran• Hamar (Highly Accelerated Map Reduce)• Out-of-core GPU-MapReduce for Large-scale Graph
Processing • DRAGON: Extending UVM to NVMe• Hierarchical, UseR-level and ON-demand File system
(HuronFS)• Optimizing Traffic Simulation App (Ex- Suzumura
Group)• Incremental Graph Community Detection• DeepGraph• Exact-Differential Traffic Simulation
20
25
30
35
40
45
15 20 25 30 35 40 45
log 2
(m)
log2(n)
USA-road-d.NY.gr
USA-road-d.LKS.gr
USA-road-d.USA.gr
Human Brain Project
Graph500 (Toy)
Graph500 (Mini)
Graph500 (Small)
Graph500 (Medium)
Graph500 (Large)
Graph500 (Huge)
1 billion nodes
1 trillion nodes
1 billion edges
1 trillion edges
Symbolic Network
USA Road Network
Twitter (tweets/day)
No. of nodes
No. of edgesK computer: 65536nodesGraph500: 17977 GTEPSThe size of graphs
Mobile Phone : SONY SO-01FSnapdragon S4 1.7GHz 4core: 2GB RAM1.03GTEPS: 235.06MTEPS/W
Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ.
Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu
List Rank GTEPS Implementation
November 2013 4 5524.12 Top-downonly
June 2014 1 17977.05 Efficient hybrid
November 2014 2 19585.2 Efficient hybrid
June, Nov 2015June Nov 2016 1 38621.4
Hybrid + Node Compression
BYTES Rich Machine + Superior BYTES
algoithm
88,000 nodes, 660,000 CPU Cores1.3 Petabyte mem20GB/s Tofu NW
≫
LLNL-IBM Sequoia1.6 million CPUs1.6 Petabyte mem
0
200
400
600
800
1000
1200
64 nodes(Scale 30)
65536 nodes(Scale 40)
Ela
pse
d T
ime
(ms)
Communicaton
73%total exec time wait in
communication
TaihuLight10 million CPUs1.3 Petabyte mem
Effective x13 performance c.f. Linpack
#1 38621.4 GTEPS(#7 10.51PF Top500)
#2 23755.7 GTEPS(#1 93.01PF Top500)
#3 23751 GTEPS(#4 17.17PF Top500)
BYTES, not FLOPS!
K-computer No.1 on Graph500: 5 Consecutive Times
• What is Graph500 Benchmark?• Supercomputer benchmark for data intensive applications.• Rank supercomputers by the performance of Breadth-First Search for very huge
graph data.
05000
1000015000200002500030000350004000045000
Jun 2012 Nov2012
Jun 2013 Nov2013
Jun 2014 Nov2014
Jul 2015 Nov2015
Jun 2016
Perf
orm
ance
(GTE
PS)
K computer (Japan)
Sequoia (U.S.A.)
Sunway TaihuLight (China)
No.1
This is achieved by a combination of high machine performance and
our software optimization.
• Efficient Sparse Matrix Representation with Bitmap
• Vertex Reordering for Bitmap Optimization• Optimizing Inter-Node Communications• Load Balancing
etc.• Koji Ueno, Toyotaro Suzumura, Naoya Maruyama, Katsuki Fujisawa, and Satoshi Matsuoka, "Efficient Breadth-First Search on
Massively Parallel and Distributed Memory Machines", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016 (to appear)
Dynamic Graphs (temporal graph)• the structure of a graph
changes dynamically over time• many real-world graphs are
classified into dynamic graph
• Most studies for large graphs have not focused on a dynamic graph data structure, but rather a static one, such as Graph 500
• Even with the large memory capacities of HPC systems, many graph applications require additional out-of-core memory (this part is still at an early stage)
Sparse Large Scale-free• social network, genome
analysis, WWW, etc.• e.g., Facebook manages
1.39 billion active users as of 2014, with more than 400 billion edges
Distributed Large-Scale Dynamic Graph Data Store Keita Iwabuchi1, 2, Scott Sallinen3, Roger Pearce2,
Brian Van Essen2, Maya Gokhale2, Satoshi Matsuoka11. Tokyo Institute of Technology (Tokyo Tech)
2. Lawrence Livermore National Laboratory (LLNL)3. University of British Columbia
Source: Jakob Enemark and Kim Sneppen, “Gene duplication models for directed networks with limits on growth”, Journal of Statistical Mechanics: Theory and Experiment 2007
Distributed Large-Scale Dynamic Graph Data Store
1. How to store dynamic graphs into local memory?(high speed graph update and look up)
2. How to extend to distributed-memory platforms?(efficient communication for real-time processing)
TOKYO INSTITUTE OF TECHNOLOGY This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-PRES-726942).
0
50
100
1 2 4 8 16 32 64Compute Nodes
Dynamic graph colouring (in-core) [SC’16]
Edge
Inse
rts/
s (m
illio
ns)
DegAwareRHH1. Leverages a linear probing hash table to increase sequential locality while
keeping dynamic graph update performance2. Adopts an async communication framework to localize communication
0
100
200
6 12 24
Spee
d U
p
Parallels (threads/processes)
Dynamic graph constructionagainst STINGER (single-node, in-core) PowerGraph (static processing only)
DegAwareRHH
Performance comparison against state-of-the-art works (STINGER, PowerGraph)
↑2.1X
↑200X
Large-scale Graph Processing Framework w/ User-Friendly Interface
• ScaleGraph• X10-based open source Highly Scalable Large Scale Graph Analytics Library
beyond the scale of billions of vertices and edges on Distributed Systems• XPregel: Pregel-based bulk synchronous parallel graph processing framework• Built-in graph algorithms (Centrality, Connected Component, Clustering, etc.)
• Python Interface • Allow users to use ScaleGraph with Spark* by easy python interface
Software stack
XPregel(Graph Processing System)
ScaleGraphBase Library
MPI
Graph Algorithm
X10 Standard Lib
X10Sparse Matrix
BLASFile IO
User Program
Third Party Library(ARPACK, METIS)X10 & C++ Team
*Apache Spark: http://spark.apache.org/
User Python Script
Cluster
Spark(RDD)
HDFS
ScaleGraph
Modern AI is enabled by Supercomputing• 25 years of AI winter after failure of symbolic logic based methods
(e.g., Prolog, ICOT) -> resurrection by DNN, basic algorithms in the 1980s but too expensive -> HPC made machines 10 million times faster in 30 years -> expensive training now possible
• Recent trends require more supercomputing power– Deeper, more complex networks (Capsule Networks)– Complex, multidimensional data (e.g., 3-D Hi-Res images)– Increasing training sets (incl. GANs)– Coupling with high-fidelity simulations– Etc.
Fig. 2: Andrew Ng (Baidu) “What Data Scientists ShouldKnow about Deep Learning”
4 Layers of Parallelism in DNN Training well supported in Post-K• Hyper Parameter Search
• Searching optimal network configs & parameters• Parallel search, massive parallelism required
• Data Parallelism• Copy the network to compute nodes, feed different batch data,
average => network reduction bound• TOFU: Extremely strong reduction, x6 EDR Infiniband
• Model Parallelism (domain decomposition)• Split and parallelize the layer calculations in propagation • Low latency required (bad for GPU) -> strong latency tolerant
cores + low latency TOFU network
• Intra-Chip ILP, Vector and other low level Parallelism• Parallelize the convolution operations etc. • SVE FP16+INT8 vectorization support + extremely high memory
bandwidth w/HBM2
• Post-K could become world’s biggest & fastest platform for DNN training!
19
Deep Learning is “All about Scale”Massive Parallelization is the key
• Data-parallel training with (Asynchronous)Stochastic Gradient Descent
– Replicate network to all the nodes, feed different data, average the gradients periodically
– Network All-Reduce Reduction in Megabytes~Gigabytes becomes the bottleneck at scale
– NVIDIA: NVLink Hardware + NICL library (up to 8 GPUs on DGX-1, 16 on DGX-2 w/ NVL Switch)
June 24, 2018 Jens Domke20
Fig. 2: Andrew Ng (Baidu) “What Data Scientists ShouldKnow about Deep Learning”
Fig. 3: Simplified DL workflow with ASGD per iteration:1. Compute gradient2. Exchange gradients via all-reduce; and3. Update network parameters
Example AI Research: Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers
Background
• In large-scale Asynchronous Stochastic Gradient Descent (ASGD), mini-batch size and gradient staleness tend to be large and unpredictable, which increase the error of trained DNN
Objective function E
W(t)-ηΣi ∇Ei
W(t+1)W(t+1)
-ηΣi ∇Ei
W(t+3)
W(t+2)
Twice asynchronous updates within
gradient computation
Staleness=0
Staleness=2
DNN parameters space
Mini-batch size
(NSubbatch: # of samples per one GPU iteration)
Mini-batch size Staleness
Measured
Predicted
4 nodes8 nodes
16 nodes MeasuredPredicted
Proposal• We propose a empirical performance model for an ASGD
deep learning system SPRINT which considers probability distribution of mini-batch size and staleness
• Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016
Interconnect Performance as important as GPU Performance to accelerate DL
• ASGD DL system SPRINT (by DENSO IT Lab) and DL speedup predictionwith performance model
– Data measured on T2 and KFC(both FDR) fitted to formulas
– Allreduce time (∈ TGPU) dep. on#nodes and #DL_parameters
• Other approaches == similar improvements:– Cuda-Aware CNTK optimizes communication pipeline 15%—23% speedup
(Banerjee et al. “Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters”)
– Reduced precision (FP[16|8|1]) to minimize msg. size w/ no or minor accuracy loss
Fig. 4: Oyama et al. “Predicting Statistics of Asynchronous SGD Parameters for aLarge-Scale Distributed Deep Learning System on GPU Supercomputers
Fast and cost-effective deep learning algorithm platform for video processing in social infrastructure
Principal Investigator: Koichi ShinodaCollaborators: Satoshi Matsuoka
Tsuyoshi MurataRio Yokota
Tokyo Institute of Technology(Members RWBC-OIL 1-1 and 2-1)
JST-REST “Development and Integration of Artificial Intelligence Technologies for Innovation Acceleration”
Our Deep Learning Scaling achievement in 2017
Component Speed Memory
Compute node x7.4 (x50) 1/15(1/10)
Parallelization x11.6*(x10) 2*(1/10)
LearningAlgorithm x11.6*(x10) 2*(1/10)
Downsizing 1/90(1/100)
Total > x1000 < 1/1000
24
: Achievement obtained by the joint work of the two groups*
DEEP LEARNING FOR SCIENCE: FUSION ENERGY SCIENCE[Prof. William Tang, Princeton Univ]
Most critical problem for Fusion Energy: avoid/mitigate large-scale major disruptions •Approach: Use of big-data-driven machine-learning (ML) predictions for the occurrence of disruptions in EUROFUSION “Joint European Torus (JET)”, DIII-D (US) & other tokomaks worldwide.•Recent Status: First principle simulation not possible. ~ 8 years of R&D (led by JET) using Support Vector Machine (SVM) ML on zero-D time trace data on CPU clusters yielding ~ reported success rates in mid-80% range for JET 30 ms before disruptions , BUT > 95% with false alarm rate < 5% actually needed for ITER (P. DeVries, et al. (2015) •Princeton Team Machine Learning Goals:(i)improve physics fidelity via development of new
ML multi-D, time-dependent software including better classifiers; (ii)develop “portable” (cross-machine) predictive software beyond JET to other devices and eventually ITER; and (iii)enhance accuracy & speed of disruption analysis for very large datasets via HPC development & deployment of advanced ML software via Deep Learning/AI Neural Networks (both Convolutional & Recurrent) in Princeton’s “Fusion Recurrent Neural Net (FRNN) Code
(JET)
FRNN DL/AI software reliably scales to 1K P-100 GPU’s on TSUBAME 3.0 associated production runs contributing strongly to Hyperpameter-
Tuning-enabled physics advances ! )Recent results: TSUBAME 3.0 supercomputer (Tokyo Tech)
Tsubame 3.0 “Grand Challenge Runs” (A. Svyatkovskii, Princeton U)–Order of thousand Tesla P100 SXM2 GPUs, 4 GPUs per node, NVlink–Tensorflow+MPI, CUDA8, CuDNN 6, OpenMPI 2.1.1, GPU Direct
Cross Machine Prediction (DIII-D to JET)
Train (DIII-D)
RNN 0D & RNN 1D ~0.80XGBoost (shallow) 0.62
Test (JET)
Overview of TSUBAME3.0BYTES-centric Architecture, Scalaibility to all 2160 GPUs,
all nodes, the entire memory hiearchy
Full Bisection BandwidghIntel Omni-Path Interconnect. 4 ports/nodeFull Bisection / 432 Terabits/s bidirectional~x2 BW of entire Internet backbone traffic
DDN Storage(Lustre FS 15.9PB+Home 45TB)
540 Compute Nodes SGI ICE XA + New BladeIntel Xeon CPU x 2+NVIDIA Pascal GPUx4 (NV-Link)
256GB memory 2TB Intel NVMe SSD47.2 AI-Petaflops, 12.1 Petaflops
Full Operations Aug. 2017
TSUBAME3.0 Co-Designed SGI ICE-XA Blade (new)- No exterior cable mess (power, NW, water)- Plan to become a future HPE product
TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC
29
Intra-node GPU via NVLink20~40GB/s
Intra-node GPU via NVLink20~40GB/s
Inter-node GPU via OmniPath12.5GB/s fully switched
HBM2 64GB2.5TB/s
DDR4256GB 150GB/s
Intel Optane1.5TB 12GB/s(planned)
NVMe Flash2TB 3GB/s
16GB/s PCIeFully Switched
16GB/s PCIeFully Switched
~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node) Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year
Terabit class network/node800Gbps (400+400)
full bisection
Any “Big” Data in the system can be moved
to anywhere via RDMA speeds
minimum 12.5GBytes/s
also with Stream Processing
Scalable to all 2160 GPUs, not just 8
TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC
30
Intra-node GPU via NVLink20~40GB/s
Intra-node GPU via NVLink20~40GB/s
Inter-node GPU via OmniPath12.5GB/s fully switched
HBM2 64GB2.5TB/s
DDR4256GB 150GB/s
Intel Optane1.5TB 12GB/s(planned)
NVMe Flash2TB 3GB/s
16GB/s PCIeFully Switched
16GB/s PCIeFully Switched
~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node) Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year
Any “Big” Data in the system can be moved
to anywhere via RDMA speeds
minimum 12.5GBytes/s
also with Stream Processing
Scalable to all 2160 GPUs, not just 8
Power Meters
Tokyo Tech / HPEBenchmarking Team
Award Ceremony at ISC2017 @ Frankfurt
TSUBAME3.0 became the first large production petaflops-scale supercomputer in the world to be #1 on the “Green500” power efficiency Wworld ranking of supercomputers
14.1 Gigaflops/W is more than x10 more efficient than PCs and Smartphones!
0 10 20 30 40 50 60 70
Riken
U-Tokyo
Tokyo Tech
Site Comparisons of AI-FP Perfs
TSUBAME3.0 T2.5
K
Oakforest-PACS (JCAHPC)
Reedbush(U&H)
PFLOPS
DFP 64bit SFP 32bit HFP 16bit
Simulation
Computer Graphics
Gaming
Big Data
Machine Learning / AI
65.8 Petaflops
Tokyo Tech GSIC leads Japan in aggregated AI-capable FLOPS TSUBAME3+2.5+KFC, in all Supercomuters and CloudsNV
T-KFC
~6700 GPUs + ~4000 CPUs
GFLO
PS
Matrix Dimension (m=n=k)
0
2000
4000
6000
8000
10000
12000
14000
16000
0 500 1000 1500 2000 2500 3000 3500 4000 4500
P100-fp16 P100 K40
NVIDIA Pascal P100 DGEMM Performane
Tremendous Recent Rise in Interest by the Japanese Government on Big Data, DL, AI, and IoT
• Three national centers on Big Data and AI launched by three competing Ministries for FY 2016 (Apr 2015-)
• MEXT – AIP (Artificial Intelligence Platform): Riken and other institutions ($~50 mil), April 2016
• A separate Post-K related AI funding as well.• Focused on basic research, Machine Learning Algorithms
• METI – AIRC (Artificial Intelligence Research Center): AIST (AIST internal budget + > $200 million FY 2017), April 2015
• Broad AI/BD/IoT, industry focus
• MOST – Universal Communication Lab: NICT ($50~55 mil)• Brain –related AI
• $1 billion commitment on inter-ministry AI research over 10 years• All 3 centers now invest in or work with top-tier supercomputing
Vice MinsiterTsuchiya@MEXTAnnoucing AIP estabishment
The current status of AI & Big Data in JapanWe need the triage of advanced algorithms/infrastructure/data but we lack the cutting edge infrastructure dedicated to AI & Big Data (c.f. HPC)
R&D MLAlgorithms
& SW
AI&DataInfrastructures
“Big”Data
B
IoT Communication, location & other data
Petabytes of DriveRecording Video
FA&Robots
Web access andmerchandice
Use of Massive Scale Data now Wasted
Seeking Innovative Application of AI & Data
AI Venture StartupsBig Companies AI/BD R&D (also Science)
In HPC, Cloud continues to be insufficient for cutting edge research => dedicated SCs dominate & racing to Exascale
Massive Rise in ComputingRequirements (1 AI-PF/person?)
Massive “Big” Data in Training
Riken -AIP
Joint RWBC Open Innov. Lab (OIL)(Director: Matsuoka)
AIST-AIRC
NICT-UCRI
Over $1B Govt.AI investmentover 10 years
AI/BD Centers & Labs in National Labs & Universities
METI AIST-AIRC ABCIas the worlds first large-scale OPEN AI Infrastructure
35
Univ. Tokyo Kashiwa Campus
• >550 AI-Petaflops• < 3MW Power• < 1.1 Avg. PUE• Operational Summer
2018
• ABCI: AI Bridging Cloud Infrastructure• Top-Level SC compute & data capability for DNN (550 AI-Petaflops)• Open Public & Dedicated infrastructure for Al & Big Data Algorithms,
Software and Applications – OPEN SOURCING AI DATACENTER
• Platform to accelerate joint academic-industry R&D for AI in Japan
ABCI: AI Bridging Cloud Infrastructure
• Open, Public, and Dedicated infrastructure for Al & Big Data Algorithms, Software, and Applications
• Open Innovation Platform to accelerate joint academic-industry R&D for AI, international collaborations are also welcome
• Top-level compute capability: 0.55 EFLOPS (DL), 37 PFLOPS (DP)• Top-level energy efficiency: lower PUE• All design and implementations will be open-sourced
36
UniversitiesResearch Institutes
Companies Open Innovation Platform for AI R&DManufacturing
Autonomouscars
Edge AI
ABCI Procurement Benchmarks
• Big Data Benchmarks– (SPEC CPU Rate)– Graph 500– MinuteSort– Node Local Storage I/O– Parallel FS I/O
• AI/ML Benchmarks– Low precision GEMM
• CNN Kernel, defines “AI-Flops”
– Single Node CNN• AlexNet and GoogLeNet• ILSVRC2012 Dataset
– Multi-Node Scalable CNN• Caffe+MPI
– Large Memory CNN• Convnet on Chainer
– RNN / LSTM• Neural Machine Translation on
Torch
37
No traditional HPC Simulation Benchmarks
except SPEC CPU.Plan on “open-sourcing”
ABCI Computing Node
38
FUJITSU PRIMERGY Server (2 servers in 2U)CPU Intel Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) x2GPU NVIDIA Tesla V100 (SXM2) x4Memory 384GiBLocal Storage 1.6TB NVMe SSD (Intel SSD DC P4600 u.2) x1Interconnect InfiniBand EDR x2
Xeon Gold 6148
Xeon Gold 6148
10.4GT/s x3DDR4-266632GB x 6
DDR4-266632GB x 6
128GB/s 128GB/s
IB HCA (100Gbps)IB HCA (100Gbps)
NVMe
UPI x3
x48 switch x64 switch
Tesla V100 SXM2 Tesla V100 SXM2
Tesla V100 SXM2 Tesla V100 SXM2
PCIe gen3 x16 PCIe gen3 x16
PCIe gen3 x16 PCIe gen3 x16
NVLink2 x2
Software and Services
39
Operating System CentOS, RHEL
Job Scheduler Univa Grid Engine
Container Engine Docker, Singularity
MPI OpenMPI, MVAPICH
Development tools Intel Parallel Studio XE Cluster Edition, PGI Professional Edition, Python, Ruby, R, Java, Scala, Perl
Deep Learning Caffe, Caffe2, TensorFlow, Theano, Torch, PyTorch, CNTK, MXnet, Chainer, Keras, etc.
Resource Alloc.
Job Bootstrap
Applications
Univa Resource Allocator
Singularity
Native / user-installed software
Univa Docker
System-provided / User-defined containers
CampaignService Type
■Software (Natively installed)
■Service Types and Container Support
Still under design
On-Demand Spot
Interactive use Batch use
Reserved
Advance reservation
Direct run
ABCI Datacenter Overview
• Single floor, inexpensive build
• Hard concrete floor with 2 tons/m2 weight tolerance
• Racks• 144 racks max.• ABCI uses 43 racks
• Power capacity• 3.25 MW max.• ABCI uses 2.3MW max.
• Water-Air Hybrid Cooling• Water Block: 60kW/Rack• Fan Coil Unit: 10kW/Rack• Total: 3.2MW min. (summer)
Commoditizing supercomputer cooling technologies to Cloud (70kW/rack)
Comparing TSUBAME3/ABCI to Classical IDCAI IDC CAPX/OPEX accelerartion by > x100
41
Traditional Xeon IDC~10KW/rack PUE 1.5~215~20 1U Xeon Servers
2 Tera AI-FLOPS(SFP) / server30~40 Tera AI-FLOP / rack
TSUBAME3 (+Volta) & ABCI IDC~70KW/rack PUE 1.0x
~36 T3 evolution servers~500 Tera AI-FLOPS(HFP) / server
~17 Peta AI-FLOPs / rack
Perf > 400~600Power Eff > 200~300
Japan Flagship 2020 “Post K” Supercomputer
42
: Compute Node
:Interconnect
LoginServers
MaitenanceServers
I/O Network…
……
………………………
HierarchicalStorage System
PortalServers
CPU• Many core, Xeon-Class ARM v8 cores + 512
bit SVE (scalable vector extensions)• Multi-hundred petaflops peak total• Power Knob feature
Memory3-D stacked DRAM, Terabyte/s BW /chip
Interconnect• TOFU3 CPU-integrated 6-D torus network
• I/O acceleration with massive SDs• 30MW+ Power, liquid cooled• Riken co-design with Fujitsu•? Million cores in system
Prime Minister Abe visiting K Computer 2013
• R
NOW
43
1. Heritage of the K-Computer, HP in simulation via extensive Co-Design• High performance: up to x100 performance of K in real applications• Multitudes of Scientific Breakthroughs via Post-K application programs• Simultaneous high performance and ease-of-programming
2. New Technology Innovations of Post-K• High Performance, esp. via high memory BW
Performance boost by “factors” c.f. mainstream CPUs in many HPC & Society5.0 apps
• Very Green e.g. extreme power efficiencyUltra Power efficient design & various power control knobs
• Arm Global Ecosystem & SVE contributionARM Ecosystem: 21 billion chips/year, SVE co-design and world’s first implementation by Fujitsu, to become global std.
• High Perf. on Society5.0 apps incl. AIArchitectural features for high perf on Society 5.0 apps based on Big Data, AI/ML, CAE/EDA, Blockchainsecurity, etc.
Post-K: The Game Changer
ARM: Massive ecosystem from embedded to HPC
Global leadership not just in the machine & apps, but as cu
tting edge IT
Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds
C P UFor the
Post-Ksupercomputer
Post-K CPU New Innnovations: Summary
44
1. Ultra high bandwidth using on-package memory & matching CPU core Recent studies show that majority of apps are memory bound, some compute bound
but can use lower precision e.g. FP16 Comparison w/mainstream CPU: much faster FPU, almost order magnitude faster
memory BW, and ultra high performance accordingly Memory controller to sustain massive on package memory (OPM) BW: difficult for
coherent memory CPU, first CPU in the world to support OPM2. Very Green e.g. extreme power efficiency
Power optimized design, clock gating & power knob, efficient cooling Power efficiency much better than CPUs, comparable to GPU systems
3. Arm Global Ecosystem & SVE contribution Annual processor production: x86 3-400mil, ARM 21bil, (2~3 bilhigh end) Rapid upbringing HPC&IDC Ecosystem(e.g. Cavium, HPE, Sandia, Bristol,…) SVE(Scalable Vector Extension) -> Arm-Fujitsu co-design, future global std.
3. High Performance on Society5.0 apps including AI Next gen AI/ML requires massive speedup => high perf chips + HPC massive
scalability across chips Post-K processor: support for AI/ML acceleration e.g. Int8/FP16+fast memory for
GPU-class convolution, fast interconnect for massive scaling Top performance in AI as well as other Society 5.0 apps
Post K Processor is… an Many-Core ARM CPU…
48 compute cores + 2 or 4 assistant (OS) cores
Near Xeon-Class performance per core
ARM V8 ---64bit ARM ecosystem
Tofu 3 + PCIe 3 external connection
…but also a GPU-like processor
SVE 512 bit vector extensions
Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes)
Cache + scratchpad local memory (sector cache)
Multi-stack 3D memory – TB/s level Mem BW, limited capacity Various features for streaming memory access, stridedaccess, etc.
Intra-chip barrier synch. and other memory enhancing features
GPU-like High performance in HPC, AI/Big Data, Blockchain…20018/3/13 45
46
Post-K Processor◆High perf FP16&Int8◆High mem BW for convolution◆Built-in scalable Tofu network
Unprecedened DL scalability
High Performance DNN Convolution
Low Precision ALU + High Memory Bandwidth + Advanced Combining of Convolution Algorithms (FFT+Winograd+GEMM)
High Performance and Ultra-Scalable Networkfor massive scaling model & data parallelism
Unprecedented Scalability of Data/
Massive Scale Deep Learning on Post-K
C P UFor the
Post-Ksupercomputer
C P UFor the
Post-Ksupercomputer
C P UFor the
Post-Ksupercomputer
C P UFor the
Post-Ksupercomputer
TOFU Network w/high injection BW for fast reduction
ModeradorNotas de la presentación京:中性能高精度演算GPUは高い、かつ、大規模にスケールするネットワークが弱い(最新で16ネットワーク)
Evaluation: WD using Integer LP
A desirable configuration set of AlexNet conv2 (Forward)Mini-batch size of 256, P100-SXM2
Each bar represents proportion of micro-batch sizes and algorithms
Evaluation: WR using Dynamic Programming
• μ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2
cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SXM2Workspace size of 64 MiB, mini-batch size of 256
Numbers on each rectangles represent micro-batch sizes
Selecting the Optimal Convolution Kernel
47
NEW! Micro Batching: Tokyo Tech. and ETH [Oyama, Tan, Hoefler & Matsuoka] Use the “micro-batch” technique to select the
best convolution kernel Direct, GEMM, FFT, Winograd Optimize both speed and memory size
On high-end GPUs, in many cases Winogrador FFT chosen over GEMM They are faster but use more memory
Currently implemented as cuDNNwrapper, applicable to all frameworks
For Post-K, (1) Winograd/FFT are selected more often, and (2) performance will be similar to GPUs in such cases
Evaluation: WD using Integer LP
A desirable configuration set of AlexNet conv2 (Forward)
Mini-batch size of 256, P100-SXM2
Each bar represents proportion of micro-batch sizes and algorithms
1
●
●
●●●●●●●
●●
●
●
●● ●
0 2 4 6 8 10
020
4060
8010
012
0
Execution time [ms]
Wor
kspa
ce s
ize [M
iB]
IMPLICIT_GEMMIMPLICIT_PRECOMP_GEMMGEMMFFTFFT_TILINGWINOGRAD_NONFUSED
Evaluation: WR using Dynamic Programming
μ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2
cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SXM2
Workspace size of 64 MiB, mini-batch size of 256
Numbers on each rectangles represent micro-batch sizes
2.33x
2.09x
1
1
undivided powerOfTwo all
Exec
utio
n tim
e [m
s]
01
23
45
6 IMPLICIT_PRECOMP_GEMMFFT_TILINGWINOGRAD_NONFUSED
256
3232323232323232
323248484848
Large Scale simulation and AI coming together[Ichimuraet. al. Univ. of Tokyo, IEEE/ACM SC17 Best Poster]
130 billion freedom earthquake of entire Tokyo on K-Computer (ACM Gordon Bell Prize Finalist, SC16,17 Best Poster)
48Too Many InstancesEarthquake
Soft Soil
Oct. 2015TSUBAME-KFC/DL(Tokyo Tech./NEC)1.4 AI-PF(Petaflops)
Cutting Edge Research AI Infrastructures in JapanAccelerating BD/AI with HPC (w/accompanying BYTES)(and my effort to design & build them)
Mar. 2017AIST AI Cloud(AIST-AIRC/NEC)8.2 AI-PF
Mar. 2017AI SupercomputerRiken AIP/Fujitsu4.1 AI-PF
Aug. 2017TSUBAME3.0 (Tokyo Tech./HPE)47.2 AI-PF (65.8 AI-PF w/Tsubame2.5)
In Production
In Production
In Production Aug 2018ABCI (AIST-AIRC)550 AI-Petaflops
AI-ExaFlop era
2020 Post-K Multi AI-Exaflops
order of magnitude over ABCI
R&D Investments into world leading AI/BD HW & SW & Algorithms and their co-design for cutting edge Infrastructure absolutely necessary (just as is with Japan Post-K and US ECP in HPC)
x5.8
x5.8
x11.7
X4~6?In Preparation
Big Data/AI and HPC Convergence Towards Post-KTokyo Tech. TSUBAME Supercomputing History�World’s Leading Supercomputer. x100,000 speedup in 17 years developing world-leading use of massively parallel, many-core technology�JST-CREST “Extreme Big Data” Project (2013-2018)Neglected Tropical Diseases (NTDs) Número de diapositiva 5Número de diapositiva 6Número de diapositiva 7Pioneering “Big Data Assimilation” EraEBD App2: Takemasa Miyoshi Group (Weather Forecast Application)9/11/2014, sudden local rainEBD System Software (Matsuoka-G)Número de diapositiva 12Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4� K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS, FujitsuK-computer No.1 on Graph500: 5 Consecutive TimesNúmero de diapositiva 15Distributed Large-Scale Dynamic Graph Data Store Large-scale Graph Processing Framework �w/ User-Friendly Interface Modern AI is enabled by Supercomputing4 Layers of Parallelism in DNN Training well supported in Post-KDeep Learning is “All about Scale”�Massive Parallelization is the keyExample AI Research: Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU SupercomputersInterconnect Performance as important as GPU Performance to accelerate DLFast and cost-effective deep learning algorithm platform for video processing in social infrastructure Our Deep Learning Scaling achievement in 2017Número de diapositiva 25Número de diapositiva 26Overview of TSUBAME3.0�BYTES-centric Architecture, Scalaibility to all 2160 GPUs, all nodes, the entire memory hiearchyTSUBAME3.0 Co-Designed SGI ICE-XA Blade (new)�- No exterior cable mess (power, NW, water)�- Plan to become a future HPE productTSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPCTSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPCNúmero de diapositiva 31浮動小数演算の精度Tremendous Recent Rise in Interest by the Japanese Government on Big Data, DL, AI, and IoTNúmero de diapositiva 34METI AIST-AIRC ABCI�as the worlds first large-scale OPEN AI InfrastructureABCI: AI Bridging Cloud InfrastructureABCI Procurement BenchmarksABCI Computing NodeSoftware and ServicesABCI Datacenter OverviewComparing TSUBAME3/ABCI to Classical IDC�AI IDC CAPX/OPEX accelerartion by > x100Japan Flagship 2020 “Post K” SupercomputerPost-K: The Game ChangerPost-K CPU New Innnovations: SummaryPost K Processor is…Massive Scale Deep Learning on Post-KSelecting the Optimal Convolution KernelLarge Scale simulation and AI coming together�[Ichimura et. al. Univ. of Tokyo, IEEE/ACM SC17 Best Poster]�Número de diapositiva 49
Recommended