48
Big Data/AI and HPC Convergence Towards Post - K Director, Riken Center for Computational Science / Professor , Tokyo Institute of Technology ACM HPDC 2018 Keynote Talk 20180614

Big Data/AI and HPC Convergence Towards Post-K - Post H2020...Big Data/AI and HPC Convergence Towards Post-K Director, Riken Center for Computational Science / Professor, Tokyo Institute

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Big Data/AI and HPC Convergence Towards Post-K

    Director, Riken Center for Computational Science /Professor, Tokyo Institute of Technology

    ACM HPDC 2018 Keynote Talk20180614

  • TSUBAME3.0

    2006 TSUBAME1.080 TeraFlops

    No.1 Asia, No.7 World10,000 cores

    2010 TSUBAME2.02.4 Petarlops No1 WorldNo.1 Production GreenACM Gordon Bell Prize

    2013 TSUBAME2.54118 GPUs Upgraded

    5.7 Petaflps, No.2 JapanAI Flops 17.1 Petaflops

    2013 TSUBAME-KFCTSUBAME3 PrototypeOil Immersive CoolingGreen World No.1

    2015 AI Prototype Upgrade (KFC/DL)

    2017 TSUBAME3.0, > 10 million cores12.1 Petaflops (AI Flops 47.2 Petaflops)

    Green World No1 HPC and Big Data / AI Convergence

    Tokyo Tech. TSUBAME Supercomputing HistoryWorld’s Leading Supercomputer. x100,000 speedup in 17 years developing world-

    leading use of massively parallel, many-core technology

    2002 “TSUBAME0”1.3 TeraFlops

    First “TeraScale”JP Univ. Supercomputer

    800 cores

    2000128 Gigaflops

    CustomSupercomputer

    32 cores

    2000Matsuoka

    GSICAppointment

    2008 TSUBAME1.2170 TeraFlops

    Word’s first GPUSupercomputer

    General Purpose CPU & Many Core Processor (GPU), Advaned Optical Networks, Non-Volatile Memory, Efficient Power Control and Cooling

  • JST-CREST “Extreme Big Data” Project (2013-2018)

    SupercomputersCompute&Batch-Oriented

    More fragile

    Cloud IDCVery low BW & EfficiencyHighly available, resilient

    Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW

    PCB

    TSV Interposer

    High Powered Main CPU

    Low Power CPU

    DRAMDRAMDRAMNVM/Flash

    NVM/Flash

    NVM/Flash

    Low Power CPU

    DRAMDRAMDRAMNVM/Flash

    NVM/Flash

    NVM/Flash

    2Tbps HBM4~6HBM Channels1.5TB/s DRAM & NVM BW

    30PB/s I/O BW Possible1 Yottabyte / Year

    EBD System Softwareincl. EBD Object System

    Large Scale Metagenomics

    Massive Sensors and Data Assimilation in Weather Prediction

    Ultra Large Scale Graphs and Social Infrastructures

    Exascale Big Data HPC

    Co-Design

    Future Non-Silo Extreme Big Data Scientific Apps

    Graph Store

    EBD BagCo-Design

    KVS

    KVS

    KVS

    EBD KVS

    Cartesian PlaneCo-Design

    Given a top-class supercomputer, how fast can we accelerate next generation big data c.f. Clouds?

    Bring HPC rigor in architectural, algorithmic, and system software performance and modeling into big data

  • Neglected Tropical Diseases (NTDs)

    • Diseases prevalent in the tropical areas due to lack of drugs whose development is impaired owing to poor economical conditions.• World Health Organization (WHO) defines 17 NTDs• More than 1 billion infections, ½ million deaths• Some are making their way to developed countries due to global warming!

    Leishmaniasis Disease(Source DNDi)

    Trypanosomaforms in

    a blood smear

    Insect Vector

    World Dengue Fever distribution (source WHO)

    Japanese government promises African aid during

    2013 TICAD V Meeting in Yokohama

    ModeradorNotas de la presentación顧みられない熱帯病、Neglected Tropical Diseases、 NTDsですが、これは患者の経済的問題から治療薬の開発が不十分な、主に熱帯地域で 蔓延する感染症の総称です。現在世界保健機関, WHOによりまして、デング熱、シャーガス病など17のNTDsが定義されており、世界では10億人以上の人々が罹患し、毎年50万人以上が死亡しております。

    このNTDですが、日本国内にはほとんど患者はおりません。しかし、2008年横浜で開催された第4回アフリカ開発会議の横浜宣言で,日本はアフリカのNTDs対策への支援を表明しており、また、2013年の第5回アフリカ会議におきましても,安部総理はこの支援を継続していくことを表明しております

    http://en.wikipedia.org/wiki/Blood_film

  • EBD vs. EBD : Large Scale Homology Search for Metagenomics[Akiyama et. al., Tokyo Tech]

    increasing

    Taxonomic composition

    Next generation sequencer

    - Revealing uncultured microbiomes and finding novel genes in various environments- Applied for human health in recent years

    O(n)

    Meas.data

    O(m) ReferenceDatabase

    O(m n) calculation

    Correlation,Similarity search

    EBD

    ・with Tokyo Dental College, Prof. Kazuyuki Ishihara

    ・Comparative metagenomic analysis bewtweenhealthy persons and patients

    Various environments

    Human body

    Sea

    Soil

    EBD

    High risk microorganisms are detected.

    Metabolic Pathway

    Metagenomic analysis of periodontitis patients

    increasing

  • Development of Ultra-fast Homology Search Toolsx100,000 ~ x1,000,000 c.f. high-end BLAST WS (both FLOPS and BYTES)

    1

    10

    100

    1000

    10000

    100000

    1000000

    BLAST GHOSTZ

    computational time for10,000 sequences (sec.)(3.9 GB DB、1CPU core)

    Suzuki, et al. Bioinformatics, 2015.

    Subsequence sequence clustering

    GHOSTZ-GPU

    01020304050607080

    1C

    1C+1

    G

    12C+

    1G

    12C+

    3G

    Spee

    d-up

    ratio

    for 1

    core

    ×70 faster than 1 coreusing 12 cores + 3 GPUs

    Suzuki, et al. PLOS ONE, 2016.

    Multithread on GPU MPI + OpenMP hybrid pallelization

    Retaining strong scaling up to 100,000 cores

    GHOST-MPKakuta, et al. (submitted)

    ×240 faster than conventional algorithm

    TSUBAME 2.5 Thin node GPUTSUBAME 2.5

    __ GHOST-MP

    mpi-BLAST

    ×80〜×100 faster

  • Tokyo Tech IT-Drug Discovery MIDLSimulation & Big Data & AI at Top HPC Scale(Tonomachi, Kawasaki-city: planned 2017, PI Yutaka Akiyama)

    Tokyo Tech’s research seeds①Drug Target selection system

    ②Glide-based Virtual Screening

    ③Novel Algorithms for fast virtualscreening against huge databases

    New Drug Discovery platform especially forspecialty peptide and nucl. acids.

    Plasma binding(ML-based)

    Membrane penetration(Mol. Dynamics simulation)

    N

    O

    N

    Minister of Health, Labour and Welfare Award of the 11th annual Merit Awards for Industry-Academia-Government Collaboration

    TSUBAME’s GPU-environment allowsWorld’s top-tier Virtual Screening

    • Yoshino et al., PLOS ONE (2015)• Chiba et al., Sci Rep (2015)

    Fragment-based efficient algorithm designed for 100-millions cmpds data

    • Yanagisawa et al., GIW (2016)

    Application projects

    Drug Discovery platform powered by Supercomputing and Machine Learning

    Investments from JP Govt., Tokyo Tech. (TSUBAME SC)Muninciple Govt (Kawasaki), JP & US Pharma

    Multi-Petaflops ComputePeta~Exabytes DataProcessing Continuously

    Cutting Edge, Large-Scale HPC & BD/AI Infrastructure Absolutely Necessary

    ModeradorNotas de la presentaciónProf. Akiyama’s research group in School of Computing has been working on computational drug discovery.

    They developed an original system named “iNTRODB” (いんとろ でぃーびー),which is an intelligent system for supporting “drug target selection”.The system helps users to find out good drug target proteins.

    They applied this system, in drug discovery project for tropical diseases.Using iNTRODB, they selected four promising target proteins. (左下図)

    Then they use our supercomputer TSUBAME for virtual screening.and finally found several promising drug candidate compounds. (右下図)

  • Pioneering “Big Data Assimilation” Era

    Mutual feedback

    High-precision Simulations

    High-precision observations

    Future-generation technologiesavailable 10 years in advance

  • EBD App2: TakemasaMiyoshi Group (Weather Forecast Application)

    Big Data Assimilation for severe weather forecast

    120 times more rapidthan hourly update cycles

    Revolutionary super-rapid 30-sec. cycle

    Goal : Pinpoint (100-m resol.) forecast of severe local weather byupdating 30-min forecast every 30 sec!

    Only in 10 minutes!

  • EBD System Software (Matsuoka-G)• Big Data Algorithms for Accelerators (GPU

    and FPGAs, low level kernels for DNN&Graph)• Fast and Memory-saving SpGEMM on GPUs• Accelerating SpMV on GPU by Reducing Memory

    Access• OpenCL-based High-Performance 3D Stencil

    Computation on FPGAs• Evaluating Strategies to Accelerate Applications

    using FPGAs• Accelerating Spiking Neural Networks on FPGAs• Directive-based Temporal-Blocking application

    • Large Scale Graph Algorithms and Sorting• No.1 on Graph500 Benchmark, 5 consecutive

    times (collab. w/Kyushu-U, Riken etc.)• Distributed Large-Scale Dynamic Graph Data

    Store & Large-scale Graph Colouring (vertex coloring)

    • Dynamic Graph Data Structure Using Local-NVRAM

    • Incremental Graph Community Detection• ScaleGraph: Large-scale Graph Processing

    Framework w/ User-Friendly Interface• GPU-HykSort: Large Scale Sorting on Massive

    GPUs• XtrSort: GPU out of core sorting• Efficient Parallel Sorting Algorithm for Variable-

    Length Keys

    • Big-Data Performance Modeling and Analysis• Co-locating HPC and Big Data Analytics• Visualizing Traffic of Large-scale Networks• I/O vs MPI Traffic Interference on Fat-tree Networks• ibprof : Low-level Profiler of MPI Network Traffic• Evaluation of HPC-Big Data Applications in Clouds• Analysis on Configurations of Burst Buffers

    • High Performance Big-Data Programming Middleware• mrCUDA: Remote-to-local GPU Migration

    Middleware• Transpiler between Python and Fortran• Hamar (Highly Accelerated Map Reduce)• Out-of-core GPU-MapReduce for Large-scale Graph

    Processing • DRAGON: Extending UVM to NVMe• Hierarchical, UseR-level and ON-demand File system

    (HuronFS)• Optimizing Traffic Simulation App (Ex- Suzumura

    Group)• Incremental Graph Community Detection• DeepGraph• Exact-Differential Traffic Simulation

  • 20

    25

    30

    35

    40

    45

    15 20 25 30 35 40 45

    log 2

    (m)

    log2(n)

    USA-road-d.NY.gr

    USA-road-d.LKS.gr

    USA-road-d.USA.gr

    Human Brain Project

    Graph500 (Toy)

    Graph500 (Mini)

    Graph500 (Small)

    Graph500 (Medium)

    Graph500 (Large)

    Graph500 (Huge)

    1 billion nodes

    1 trillion nodes

    1 billion edges

    1 trillion edges

    Symbolic Network

    USA Road Network

    Twitter (tweets/day)

    No. of nodes

    No. of edgesK computer: 65536nodesGraph500: 17977 GTEPSThe size of graphs

    Mobile Phone : SONY SO-01FSnapdragon S4 1.7GHz 4core: 2GB RAM1.03GTEPS: 235.06MTEPS/W

  • Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ.

    Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu

    List Rank GTEPS Implementation

    November 2013 4 5524.12 Top-downonly

    June 2014 1 17977.05 Efficient hybrid

    November 2014 2 19585.2 Efficient hybrid

    June, Nov 2015June Nov 2016 1 38621.4

    Hybrid + Node Compression

    BYTES Rich Machine + Superior BYTES

    algoithm

    88,000 nodes, 660,000 CPU Cores1.3 Petabyte mem20GB/s Tofu NW

    LLNL-IBM Sequoia1.6 million CPUs1.6 Petabyte mem

    0

    200

    400

    600

    800

    1000

    1200

    64 nodes(Scale 30)

    65536 nodes(Scale 40)

    Ela

    pse

    d T

    ime

    (ms)

    Communicaton

    73%total exec time wait in

    communication

    TaihuLight10 million CPUs1.3 Petabyte mem

    Effective x13 performance c.f. Linpack

    #1 38621.4 GTEPS(#7 10.51PF Top500)

    #2 23755.7 GTEPS(#1 93.01PF Top500)

    #3 23751 GTEPS(#4 17.17PF Top500)

    BYTES, not FLOPS!

  • K-computer No.1 on Graph500: 5 Consecutive Times

    • What is Graph500 Benchmark?• Supercomputer benchmark for data intensive applications.• Rank supercomputers by the performance of Breadth-First Search for very huge

    graph data.

    05000

    1000015000200002500030000350004000045000

    Jun 2012 Nov2012

    Jun 2013 Nov2013

    Jun 2014 Nov2014

    Jul 2015 Nov2015

    Jun 2016

    Perf

    orm

    ance

    (GTE

    PS)

    K computer (Japan)

    Sequoia (U.S.A.)

    Sunway TaihuLight (China)

    No.1

    This is achieved by a combination of high machine performance and

    our software optimization.

    • Efficient Sparse Matrix Representation with Bitmap

    • Vertex Reordering for Bitmap Optimization• Optimizing Inter-Node Communications• Load Balancing

    etc.• Koji Ueno, Toyotaro Suzumura, Naoya Maruyama, Katsuki Fujisawa, and Satoshi Matsuoka, "Efficient Breadth-First Search on

    Massively Parallel and Distributed Memory Machines", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016 (to appear)

  • Dynamic Graphs (temporal graph)• the structure of a graph

    changes dynamically over time• many real-world graphs are

    classified into dynamic graph

    • Most studies for large graphs have not focused on a dynamic graph data structure, but rather a static one, such as Graph 500

    • Even with the large memory capacities of HPC systems, many graph applications require additional out-of-core memory (this part is still at an early stage)

    Sparse Large Scale-free• social network, genome

    analysis, WWW, etc.• e.g., Facebook manages

    1.39 billion active users as of 2014, with more than 400 billion edges

    Distributed Large-Scale Dynamic Graph Data Store Keita Iwabuchi1, 2, Scott Sallinen3, Roger Pearce2,

    Brian Van Essen2, Maya Gokhale2, Satoshi Matsuoka11. Tokyo Institute of Technology (Tokyo Tech)

    2. Lawrence Livermore National Laboratory (LLNL)3. University of British Columbia

    Source: Jakob Enemark and Kim Sneppen, “Gene duplication models for directed networks with limits on growth”, Journal of Statistical Mechanics: Theory and Experiment 2007

  • Distributed Large-Scale Dynamic Graph Data Store

    1. How to store dynamic graphs into local memory?(high speed graph update and look up)

    2. How to extend to distributed-memory platforms?(efficient communication for real-time processing)

    TOKYO INSTITUTE OF TECHNOLOGY This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-PRES-726942).

    0

    50

    100

    1 2 4 8 16 32 64Compute Nodes

    Dynamic graph colouring (in-core) [SC’16]

    Edge

    Inse

    rts/

    s (m

    illio

    ns)

    DegAwareRHH1. Leverages a linear probing hash table to increase sequential locality while

    keeping dynamic graph update performance2. Adopts an async communication framework to localize communication

    0

    100

    200

    6 12 24

    Spee

    d U

    p

    Parallels (threads/processes)

    Dynamic graph constructionagainst STINGER (single-node, in-core) PowerGraph (static processing only)

    DegAwareRHH

    Performance comparison against state-of-the-art works (STINGER, PowerGraph)

    ↑2.1X

    ↑200X

  • Large-scale Graph Processing Framework w/ User-Friendly Interface

    • ScaleGraph• X10-based open source Highly Scalable Large Scale Graph Analytics Library

    beyond the scale of billions of vertices and edges on Distributed Systems• XPregel: Pregel-based bulk synchronous parallel graph processing framework• Built-in graph algorithms (Centrality, Connected Component, Clustering, etc.)

    • Python Interface • Allow users to use ScaleGraph with Spark* by easy python interface

    Software stack

    XPregel(Graph Processing System)

    ScaleGraphBase Library

    MPI

    Graph Algorithm

    X10 Standard Lib

    X10Sparse Matrix

    BLASFile IO

    User Program

    Third Party Library(ARPACK, METIS)X10 & C++ Team

    *Apache Spark: http://spark.apache.org/

    User Python Script

    Cluster

    Spark(RDD)

    HDFS

    ScaleGraph

  • Modern AI is enabled by Supercomputing• 25 years of AI winter after failure of symbolic logic based methods

    (e.g., Prolog, ICOT) -> resurrection by DNN, basic algorithms in the 1980s but too expensive -> HPC made machines 10 million times faster in 30 years -> expensive training now possible

    • Recent trends require more supercomputing power– Deeper, more complex networks (Capsule Networks)– Complex, multidimensional data (e.g., 3-D Hi-Res images)– Increasing training sets (incl. GANs)– Coupling with high-fidelity simulations– Etc.

    Fig. 2: Andrew Ng (Baidu) “What Data Scientists ShouldKnow about Deep Learning”

  • 4 Layers of Parallelism in DNN Training well supported in Post-K• Hyper Parameter Search

    • Searching optimal network configs & parameters• Parallel search, massive parallelism required

    • Data Parallelism• Copy the network to compute nodes, feed different batch data,

    average => network reduction bound• TOFU: Extremely strong reduction, x6 EDR Infiniband

    • Model Parallelism (domain decomposition)• Split and parallelize the layer calculations in propagation • Low latency required (bad for GPU) -> strong latency tolerant

    cores + low latency TOFU network

    • Intra-Chip ILP, Vector and other low level Parallelism• Parallelize the convolution operations etc. • SVE FP16+INT8 vectorization support + extremely high memory

    bandwidth w/HBM2

    • Post-K could become world’s biggest & fastest platform for DNN training!

    19

  • Deep Learning is “All about Scale”Massive Parallelization is the key

    • Data-parallel training with (Asynchronous)Stochastic Gradient Descent

    – Replicate network to all the nodes, feed different data, average the gradients periodically

    – Network All-Reduce Reduction in Megabytes~Gigabytes becomes the bottleneck at scale

    – NVIDIA: NVLink Hardware + NICL library (up to 8 GPUs on DGX-1, 16 on DGX-2 w/ NVL Switch)

    June 24, 2018 Jens Domke20

    Fig. 2: Andrew Ng (Baidu) “What Data Scientists ShouldKnow about Deep Learning”

    Fig. 3: Simplified DL workflow with ASGD per iteration:1. Compute gradient2. Exchange gradients via all-reduce; and3. Update network parameters

  • Example AI Research: Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers

    Background

    • In large-scale Asynchronous Stochastic Gradient Descent (ASGD), mini-batch size and gradient staleness tend to be large and unpredictable, which increase the error of trained DNN

    Objective function E

    W(t)-ηΣi ∇Ei

    W(t+1)W(t+1)

    -ηΣi ∇Ei

    W(t+3)

    W(t+2)

    Twice asynchronous updates within

    gradient computation

    Staleness=0

    Staleness=2

    DNN parameters space

    Mini-batch size

    (NSubbatch: # of samples per one GPU iteration)

    Mini-batch size Staleness

    Measured

    Predicted

    4 nodes8 nodes

    16 nodes MeasuredPredicted

    Proposal• We propose a empirical performance model for an ASGD

    deep learning system SPRINT which considers probability distribution of mini-batch size and staleness

    • Yosuke Oyama, Akihiro Nomura, Ikuro Sato, Hiroki Nishimura, Yukimasa Tamatsu, and Satoshi Matsuoka, "Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU Supercomputers", in proceedings of 2016 IEEE International Conference on Big Data (IEEE BigData 2016), Washington D.C., Dec. 5-8, 2016

  • Interconnect Performance as important as GPU Performance to accelerate DL

    • ASGD DL system SPRINT (by DENSO IT Lab) and DL speedup predictionwith performance model

    – Data measured on T2 and KFC(both FDR) fitted to formulas

    – Allreduce time (∈ TGPU) dep. on#nodes and #DL_parameters

    • Other approaches == similar improvements:– Cuda-Aware CNTK optimizes communication pipeline 15%—23% speedup

    (Banerjee et al. “Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters”)

    – Reduced precision (FP[16|8|1]) to minimize msg. size w/ no or minor accuracy loss

    Fig. 4: Oyama et al. “Predicting Statistics of Asynchronous SGD Parameters for aLarge-Scale Distributed Deep Learning System on GPU Supercomputers

  • Fast and cost-effective deep learning algorithm platform for video processing in social infrastructure

    Principal Investigator: Koichi ShinodaCollaborators: Satoshi Matsuoka

    Tsuyoshi MurataRio Yokota

    Tokyo Institute of Technology(Members RWBC-OIL 1-1 and 2-1)

    JST-REST “Development and Integration of Artificial Intelligence Technologies for Innovation Acceleration”

  • Our Deep Learning Scaling achievement in 2017

    Component Speed Memory

    Compute node x7.4 (x50) 1/15(1/10)

    Parallelization x11.6*(x10) 2*(1/10)

    LearningAlgorithm x11.6*(x10) 2*(1/10)

    Downsizing 1/90(1/100)

    Total > x1000 < 1/1000

    24

    : Achievement obtained by the joint work of the two groups*

  • DEEP LEARNING FOR SCIENCE: FUSION ENERGY SCIENCE[Prof. William Tang, Princeton Univ]

    Most critical problem for Fusion Energy: avoid/mitigate large-scale major disruptions •Approach: Use of big-data-driven machine-learning (ML) predictions for the occurrence of disruptions in EUROFUSION “Joint European Torus (JET)”, DIII-D (US) & other tokomaks worldwide.•Recent Status: First principle simulation not possible. ~ 8 years of R&D (led by JET) using Support Vector Machine (SVM) ML on zero-D time trace data on CPU clusters yielding ~ reported success rates in mid-80% range for JET 30 ms before disruptions , BUT > 95% with false alarm rate < 5% actually needed for ITER (P. DeVries, et al. (2015) •Princeton Team Machine Learning Goals:(i)improve physics fidelity via development of new

    ML multi-D, time-dependent software including better classifiers; (ii)develop “portable” (cross-machine) predictive software beyond JET to other devices and eventually ITER; and (iii)enhance accuracy & speed of disruption analysis for very large datasets via HPC development & deployment of advanced ML software via Deep Learning/AI Neural Networks (both Convolutional & Recurrent) in Princeton’s “Fusion Recurrent Neural Net (FRNN) Code

    (JET)

  • FRNN DL/AI software reliably scales to 1K P-100 GPU’s on TSUBAME 3.0 associated production runs contributing strongly to Hyperpameter-

    Tuning-enabled physics advances ! )Recent results: TSUBAME 3.0 supercomputer (Tokyo Tech)

    Tsubame 3.0 “Grand Challenge Runs” (A. Svyatkovskii, Princeton U)–Order of thousand Tesla P100 SXM2 GPUs, 4 GPUs per node, NVlink–Tensorflow+MPI, CUDA8, CuDNN 6, OpenMPI 2.1.1, GPU Direct

    Cross Machine Prediction (DIII-D to JET)

    Train (DIII-D)

    RNN 0D & RNN 1D ~0.80XGBoost (shallow) 0.62

    Test (JET)

  • Overview of TSUBAME3.0BYTES-centric Architecture, Scalaibility to all 2160 GPUs,

    all nodes, the entire memory hiearchy

    Full Bisection BandwidghIntel Omni-Path Interconnect. 4 ports/nodeFull Bisection / 432 Terabits/s bidirectional~x2 BW of entire Internet backbone traffic

    DDN Storage(Lustre FS 15.9PB+Home 45TB)

    540 Compute Nodes SGI ICE XA + New BladeIntel Xeon CPU x 2+NVIDIA Pascal GPUx4 (NV-Link)

    256GB memory 2TB Intel NVMe SSD47.2 AI-Petaflops, 12.1 Petaflops

    Full Operations Aug. 2017

  • TSUBAME3.0 Co-Designed SGI ICE-XA Blade (new)- No exterior cable mess (power, NW, water)- Plan to become a future HPE product

  • TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC

    29

    Intra-node GPU via NVLink20~40GB/s

    Intra-node GPU via NVLink20~40GB/s

    Inter-node GPU via OmniPath12.5GB/s fully switched

    HBM2 64GB2.5TB/s

    DDR4256GB 150GB/s

    Intel Optane1.5TB 12GB/s(planned)

    NVMe Flash2TB 3GB/s

    16GB/s PCIeFully Switched

    16GB/s PCIeFully Switched

    ~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node) Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year

    Terabit class network/node800Gbps (400+400)

    full bisection

    Any “Big” Data in the system can be moved

    to anywhere via RDMA speeds

    minimum 12.5GBytes/s

    also with Stream Processing

    Scalable to all 2160 GPUs, not just 8

  • TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPC

    30

    Intra-node GPU via NVLink20~40GB/s

    Intra-node GPU via NVLink20~40GB/s

    Inter-node GPU via OmniPath12.5GB/s fully switched

    HBM2 64GB2.5TB/s

    DDR4256GB 150GB/s

    Intel Optane1.5TB 12GB/s(planned)

    NVMe Flash2TB 3GB/s

    16GB/s PCIeFully Switched

    16GB/s PCIeFully Switched

    ~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node) Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year

    Any “Big” Data in the system can be moved

    to anywhere via RDMA speeds

    minimum 12.5GBytes/s

    also with Stream Processing

    Scalable to all 2160 GPUs, not just 8

  • Power Meters

    Tokyo Tech / HPEBenchmarking Team

    Award Ceremony at ISC2017 @ Frankfurt

    TSUBAME3.0 became the first large production petaflops-scale supercomputer in the world to be #1 on the “Green500” power efficiency Wworld ranking of supercomputers

    14.1 Gigaflops/W is more than x10 more efficient than PCs and Smartphones!

  • 0 10 20 30 40 50 60 70

    Riken

    U-Tokyo

    Tokyo Tech

    Site Comparisons of AI-FP Perfs

    TSUBAME3.0 T2.5

    K

    Oakforest-PACS (JCAHPC)

    Reedbush(U&H)

    PFLOPS

    DFP 64bit SFP 32bit HFP 16bit

    Simulation

    Computer Graphics

    Gaming

    Big Data

    Machine Learning / AI

    65.8 Petaflops

    Tokyo Tech GSIC leads Japan in aggregated AI-capable FLOPS TSUBAME3+2.5+KFC, in all Supercomuters and CloudsNV

    T-KFC

    ~6700 GPUs + ~4000 CPUs

    GFLO

    PS

    Matrix Dimension (m=n=k)

    0

    2000

    4000

    6000

    8000

    10000

    12000

    14000

    16000

    0 500 1000 1500 2000 2500 3000 3500 4000 4500

    P100-fp16 P100 K40

    NVIDIA Pascal P100 DGEMM Performane

  • Tremendous Recent Rise in Interest by the Japanese Government on Big Data, DL, AI, and IoT

    • Three national centers on Big Data and AI launched by three competing Ministries for FY 2016 (Apr 2015-)

    • MEXT – AIP (Artificial Intelligence Platform): Riken and other institutions ($~50 mil), April 2016

    • A separate Post-K related AI funding as well.• Focused on basic research, Machine Learning Algorithms

    • METI – AIRC (Artificial Intelligence Research Center): AIST (AIST internal budget + > $200 million FY 2017), April 2015

    • Broad AI/BD/IoT, industry focus

    • MOST – Universal Communication Lab: NICT ($50~55 mil)• Brain –related AI

    • $1 billion commitment on inter-ministry AI research over 10 years• All 3 centers now invest in or work with top-tier supercomputing

    Vice MinsiterTsuchiya@MEXTAnnoucing AIP estabishment

  • The current status of AI & Big Data in JapanWe need the triage of advanced algorithms/infrastructure/data but we lack the cutting edge infrastructure dedicated to AI & Big Data (c.f. HPC)

    R&D MLAlgorithms

    & SW

    AI&DataInfrastructures

    “Big”Data

    B

    IoT Communication, location & other data

    Petabytes of DriveRecording Video

    FA&Robots

    Web access andmerchandice

    Use of Massive Scale Data now Wasted

    Seeking Innovative Application of AI & Data

    AI Venture StartupsBig Companies AI/BD R&D (also Science)

    In HPC, Cloud continues to be insufficient for cutting edge research => dedicated SCs dominate & racing to Exascale

    Massive Rise in ComputingRequirements (1 AI-PF/person?)

    Massive “Big” Data in Training

    Riken -AIP

    Joint RWBC Open Innov. Lab (OIL)(Director: Matsuoka)

    AIST-AIRC

    NICT-UCRI

    Over $1B Govt.AI investmentover 10 years

    AI/BD Centers & Labs in National Labs & Universities

  • METI AIST-AIRC ABCIas the worlds first large-scale OPEN AI Infrastructure

    35

    Univ. Tokyo Kashiwa Campus

    • >550 AI-Petaflops• < 3MW Power• < 1.1 Avg. PUE• Operational Summer

    2018

    • ABCI: AI Bridging Cloud Infrastructure• Top-Level SC compute & data capability for DNN (550 AI-Petaflops)• Open Public & Dedicated infrastructure for Al & Big Data Algorithms,

    Software and Applications – OPEN SOURCING AI DATACENTER

    • Platform to accelerate joint academic-industry R&D for AI in Japan

  • ABCI: AI Bridging Cloud Infrastructure

    • Open, Public, and Dedicated infrastructure for Al & Big Data Algorithms, Software, and Applications

    • Open Innovation Platform to accelerate joint academic-industry R&D for AI, international collaborations are also welcome

    • Top-level compute capability: 0.55 EFLOPS (DL), 37 PFLOPS (DP)• Top-level energy efficiency: lower PUE• All design and implementations will be open-sourced

    36

    UniversitiesResearch Institutes

    Companies Open Innovation Platform for AI R&DManufacturing

    Autonomouscars

    Edge AI

  • ABCI Procurement Benchmarks

    • Big Data Benchmarks– (SPEC CPU Rate)– Graph 500– MinuteSort– Node Local Storage I/O– Parallel FS I/O

    • AI/ML Benchmarks– Low precision GEMM

    • CNN Kernel, defines “AI-Flops”

    – Single Node CNN• AlexNet and GoogLeNet• ILSVRC2012 Dataset

    – Multi-Node Scalable CNN• Caffe+MPI

    – Large Memory CNN• Convnet on Chainer

    – RNN / LSTM• Neural Machine Translation on

    Torch

    37

    No traditional HPC Simulation Benchmarks

    except SPEC CPU.Plan on “open-sourcing”

  • ABCI Computing Node

    38

    FUJITSU PRIMERGY Server (2 servers in 2U)CPU Intel Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) x2GPU NVIDIA Tesla V100 (SXM2) x4Memory 384GiBLocal Storage 1.6TB NVMe SSD (Intel SSD DC P4600 u.2) x1Interconnect InfiniBand EDR x2

    Xeon Gold 6148

    Xeon Gold 6148

    10.4GT/s x3DDR4-266632GB x 6

    DDR4-266632GB x 6

    128GB/s 128GB/s

    IB HCA (100Gbps)IB HCA (100Gbps)

    NVMe

    UPI x3

    x48 switch x64 switch

    Tesla V100 SXM2 Tesla V100 SXM2

    Tesla V100 SXM2 Tesla V100 SXM2

    PCIe gen3 x16 PCIe gen3 x16

    PCIe gen3 x16 PCIe gen3 x16

    NVLink2 x2

  • Software and Services

    39

    Operating System CentOS, RHEL

    Job Scheduler Univa Grid Engine

    Container Engine Docker, Singularity

    MPI OpenMPI, MVAPICH

    Development tools Intel Parallel Studio XE Cluster Edition, PGI Professional Edition, Python, Ruby, R, Java, Scala, Perl

    Deep Learning Caffe, Caffe2, TensorFlow, Theano, Torch, PyTorch, CNTK, MXnet, Chainer, Keras, etc.

    Resource Alloc.

    Job Bootstrap

    Applications

    Univa Resource Allocator

    Singularity

    Native / user-installed software

    Univa Docker

    System-provided / User-defined containers

    CampaignService Type

    ■Software (Natively installed)

    ■Service Types and Container Support

    Still under design

    On-Demand Spot

    Interactive use Batch use

    Reserved

    Advance reservation

    Direct run

  • ABCI Datacenter Overview

    • Single floor, inexpensive build

    • Hard concrete floor with 2 tons/m2 weight tolerance

    • Racks• 144 racks max.• ABCI uses 43 racks

    • Power capacity• 3.25 MW max.• ABCI uses 2.3MW max.

    • Water-Air Hybrid Cooling• Water Block: 60kW/Rack• Fan Coil Unit: 10kW/Rack• Total: 3.2MW min. (summer)

    Commoditizing supercomputer cooling technologies to Cloud (70kW/rack)

  • Comparing TSUBAME3/ABCI to Classical IDCAI IDC CAPX/OPEX accelerartion by > x100

    41

    Traditional Xeon IDC~10KW/rack PUE 1.5~215~20 1U Xeon Servers

    2 Tera AI-FLOPS(SFP) / server30~40 Tera AI-FLOP / rack

    TSUBAME3 (+Volta) & ABCI IDC~70KW/rack PUE 1.0x

    ~36 T3 evolution servers~500 Tera AI-FLOPS(HFP) / server

    ~17 Peta AI-FLOPs / rack

    Perf > 400~600Power Eff > 200~300

  • Japan Flagship 2020 “Post K” Supercomputer

    42

    : Compute Node

    :Interconnect

    LoginServers

    MaitenanceServers

    I/O Network…

    ……

    ………………………

    HierarchicalStorage System

    PortalServers

    CPU• Many core, Xeon-Class ARM v8 cores + 512

    bit SVE (scalable vector extensions)• Multi-hundred petaflops peak total• Power Knob feature

    Memory3-D stacked DRAM, Terabyte/s BW /chip

    Interconnect• TOFU3 CPU-integrated 6-D torus network

    • I/O acceleration with massive SDs• 30MW+ Power, liquid cooled• Riken co-design with Fujitsu•? Million cores in system

    Prime Minister Abe visiting K Computer 2013

    • R

    NOW

  • 43

    1. Heritage of the K-Computer, HP in simulation via extensive Co-Design• High performance: up to x100 performance of K in real applications• Multitudes of Scientific Breakthroughs via Post-K application programs• Simultaneous high performance and ease-of-programming

    2. New Technology Innovations of Post-K• High Performance, esp. via high memory BW

    Performance boost by “factors” c.f. mainstream CPUs in many HPC & Society5.0 apps

    • Very Green e.g. extreme power efficiencyUltra Power efficient design & various power control knobs

    • Arm Global Ecosystem & SVE contributionARM Ecosystem: 21 billion chips/year, SVE co-design and world’s first implementation by Fujitsu, to become global std.

    • High Perf. on Society5.0 apps incl. AIArchitectural features for high perf on Society 5.0 apps based on Big Data, AI/ML, CAE/EDA, Blockchainsecurity, etc.

    Post-K: The Game Changer

    ARM: Massive ecosystem from embedded to HPC

    Global leadership not just in the machine & apps, but as cu

    tting edge IT

    Technology not just limited to Post-K, but into societal IT infrastructures e.g. Clouds

    C P UFor the

    Post-Ksupercomputer

  • Post-K CPU New Innnovations: Summary

    44

    1. Ultra high bandwidth using on-package memory & matching CPU core Recent studies show that majority of apps are memory bound, some compute bound

    but can use lower precision e.g. FP16 Comparison w/mainstream CPU: much faster FPU, almost order magnitude faster

    memory BW, and ultra high performance accordingly Memory controller to sustain massive on package memory (OPM) BW: difficult for

    coherent memory CPU, first CPU in the world to support OPM2. Very Green e.g. extreme power efficiency

    Power optimized design, clock gating & power knob, efficient cooling Power efficiency much better than CPUs, comparable to GPU systems

    3. Arm Global Ecosystem & SVE contribution Annual processor production: x86 3-400mil, ARM 21bil, (2~3 bilhigh end) Rapid upbringing HPC&IDC Ecosystem(e.g. Cavium, HPE, Sandia, Bristol,…) SVE(Scalable Vector Extension) -> Arm-Fujitsu co-design, future global std.

    3. High Performance on Society5.0 apps including AI Next gen AI/ML requires massive speedup => high perf chips + HPC massive

    scalability across chips Post-K processor: support for AI/ML acceleration e.g. Int8/FP16+fast memory for

    GPU-class convolution, fast interconnect for massive scaling Top performance in AI as well as other Society 5.0 apps

  • Post K Processor is… an Many-Core ARM CPU…

    48 compute cores + 2 or 4 assistant (OS) cores

    Near Xeon-Class performance per core

    ARM V8 ---64bit ARM ecosystem

    Tofu 3 + PCIe 3 external connection

    …but also a GPU-like processor

    SVE 512 bit vector extensions

    Integer (1, 2, 4, 8 bytes) + Float (16, 32, 64 bytes)

    Cache + scratchpad local memory (sector cache)

    Multi-stack 3D memory – TB/s level Mem BW, limited capacity Various features for streaming memory access, stridedaccess, etc.

    Intra-chip barrier synch. and other memory enhancing features

    GPU-like High performance in HPC, AI/Big Data, Blockchain…20018/3/13 45

  • 46

    Post-K Processor◆High perf FP16&Int8◆High mem BW for convolution◆Built-in scalable Tofu network

    Unprecedened DL scalability

    High Performance DNN Convolution

    Low Precision ALU + High Memory Bandwidth + Advanced Combining of Convolution Algorithms (FFT+Winograd+GEMM)

    High Performance and Ultra-Scalable Networkfor massive scaling model & data parallelism

    Unprecedented Scalability of Data/

    Massive Scale Deep Learning on Post-K

    C P UFor the

    Post-Ksupercomputer

    C P UFor the

    Post-Ksupercomputer

    C P UFor the

    Post-Ksupercomputer

    C P UFor the

    Post-Ksupercomputer

    TOFU Network w/high injection BW for fast reduction

    ModeradorNotas de la presentación京:中性能高精度演算GPUは高い、かつ、大規模にスケールするネットワークが弱い(最新で16ネットワーク)

  • Evaluation: WD using Integer LP

    A desirable configuration set of AlexNet conv2 (Forward)Mini-batch size of 256, P100-SXM2

    Each bar represents proportion of micro-batch sizes and algorithms

    Evaluation: WR using Dynamic Programming

    • μ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2

    cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SXM2Workspace size of 64 MiB, mini-batch size of 256

    Numbers on each rectangles represent micro-batch sizes

    Selecting the Optimal Convolution Kernel

    47

    NEW! Micro Batching: Tokyo Tech. and ETH [Oyama, Tan, Hoefler & Matsuoka] Use the “micro-batch” technique to select the

    best convolution kernel Direct, GEMM, FFT, Winograd Optimize both speed and memory size

    On high-end GPUs, in many cases Winogrador FFT chosen over GEMM They are faster but use more memory

    Currently implemented as cuDNNwrapper, applicable to all frameworks

    For Post-K, (1) Winograd/FFT are selected more often, and (2) performance will be similar to GPUs in such cases

    Evaluation: WD using Integer LP

    A desirable configuration set of AlexNet conv2 (Forward)

    Mini-batch size of 256, P100-SXM2

    Each bar represents proportion of micro-batch sizes and algorithms

    1

  • ●●●●●●●

    ●●

    ●● ●

    0 2 4 6 8 10

    020

    4060

    8010

    012

    0

    Execution time [ms]

    Wor

    kspa

    ce s

    ize [M

    iB]

    IMPLICIT_GEMMIMPLICIT_PRECOMP_GEMMGEMMFFTFFT_TILINGWINOGRAD_NONFUSED

    Evaluation: WR using Dynamic Programming

    μ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2

    cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SXM2

    Workspace size of 64 MiB, mini-batch size of 256

    Numbers on each rectangles represent micro-batch sizes

    2.33x

    2.09x

    1

    1

  • undivided powerOfTwo all

    Exec

    utio

    n tim

    e [m

    s]

    01

    23

    45

    6 IMPLICIT_PRECOMP_GEMMFFT_TILINGWINOGRAD_NONFUSED

    256

    3232323232323232

    323248484848

  • Large Scale simulation and AI coming together[Ichimuraet. al. Univ. of Tokyo, IEEE/ACM SC17 Best Poster]

    130 billion freedom earthquake of entire Tokyo on K-Computer (ACM Gordon Bell Prize Finalist, SC16,17 Best Poster)

    48Too Many InstancesEarthquake

    Soft Soil

  • Oct. 2015TSUBAME-KFC/DL(Tokyo Tech./NEC)1.4 AI-PF(Petaflops)

    Cutting Edge Research AI Infrastructures in JapanAccelerating BD/AI with HPC (w/accompanying BYTES)(and my effort to design & build them)

    Mar. 2017AIST AI Cloud(AIST-AIRC/NEC)8.2 AI-PF

    Mar. 2017AI SupercomputerRiken AIP/Fujitsu4.1 AI-PF

    Aug. 2017TSUBAME3.0 (Tokyo Tech./HPE)47.2 AI-PF (65.8 AI-PF w/Tsubame2.5)

    In Production

    In Production

    In Production Aug 2018ABCI (AIST-AIRC)550 AI-Petaflops

    AI-ExaFlop era

    2020 Post-K Multi AI-Exaflops

    order of magnitude over ABCI

    R&D Investments into world leading AI/BD HW & SW & Algorithms and their co-design for cutting edge Infrastructure absolutely necessary (just as is with Japan Post-K and US ECP in HPC)

    x5.8

    x5.8

    x11.7

    X4~6?In Preparation

    Big Data/AI and HPC Convergence Towards Post-KTokyo Tech. TSUBAME Supercomputing History�World’s Leading Supercomputer. x100,000 speedup in 17 years developing world-leading use of massively parallel, many-core technology�JST-CREST “Extreme Big Data” Project (2013-2018)Neglected Tropical Diseases (NTDs) Número de diapositiva 5Número de diapositiva 6Número de diapositiva 7Pioneering “Big Data Assimilation” EraEBD App2: Takemasa Miyoshi Group (Weather Forecast Application)9/11/2014, sudden local rainEBD System Software (Matsuoka-G)Número de diapositiva 12Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4� K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ. Kyushu [Fujisawa Graph CREST], Riken AICS, FujitsuK-computer No.1 on Graph500: 5 Consecutive TimesNúmero de diapositiva 15Distributed Large-Scale Dynamic Graph Data Store Large-scale Graph Processing Framework �w/ User-Friendly Interface Modern AI is enabled by Supercomputing4 Layers of Parallelism in DNN Training well supported in Post-KDeep Learning is “All about Scale”�Massive Parallelization is the keyExample AI Research: Predicting Statistics of Asynchronous SGD Parameters for a Large-Scale Distributed Deep Learning System on GPU SupercomputersInterconnect Performance as important as GPU Performance to accelerate DLFast and cost-effective deep learning algorithm platform for video processing in social infrastructure Our Deep Learning Scaling achievement in 2017Número de diapositiva 25Número de diapositiva 26Overview of TSUBAME3.0�BYTES-centric Architecture, Scalaibility to all 2160 GPUs, all nodes, the entire memory hiearchyTSUBAME3.0 Co-Designed SGI ICE-XA Blade (new)�- No exterior cable mess (power, NW, water)�- Plan to become a future HPE productTSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPCTSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPCNúmero de diapositiva 31浮動小数演算の精度Tremendous Recent Rise in Interest by the Japanese Government on Big Data, DL, AI, and IoTNúmero de diapositiva 34METI AIST-AIRC ABCI�as the worlds first large-scale OPEN AI InfrastructureABCI: AI Bridging Cloud InfrastructureABCI Procurement BenchmarksABCI Computing NodeSoftware and ServicesABCI Datacenter OverviewComparing TSUBAME3/ABCI to Classical IDC�AI IDC CAPX/OPEX accelerartion by > x100Japan Flagship 2020 “Post K” SupercomputerPost-K: The Game ChangerPost-K CPU New Innnovations: SummaryPost K Processor is…Massive Scale Deep Learning on Post-KSelecting the Optimal Convolution KernelLarge Scale simulation and AI coming together�[Ichimura et. al. Univ. of Tokyo, IEEE/ACM SC17 Best Poster]�Número de diapositiva 49