26
Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC Dr. Fu Li [email protected] Quantum Cloud Future (Beijing) Technologies Co., Ltd.

Scale-out Computing Model on Massive Core System: From HPC ... · Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC Dr. Fu Li [email protected] Quantum Cloud

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC

    Dr. Fu Li [email protected]

    Quantum Cloud Future (Beijing) Technologies Co., Ltd.

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Cook Book

    1. What is Massive Core System (MCS)? 1.1. HPC system 1.2. GPU system 1.3. MicroSlides: Fabric-based SoC

    2. Why scale-out computing is important in MCS? 3. How to make MCS faster?

    3.1. MPI and openMP in HPC 3.2. Memory coalescing and cudaDMA in GPU computing

    4. QCF’s scale-out computing model for Microslides 4.1. the hardware (Socionext) 4.2. the architecture 4.3. the result (arm vs x86 vs GPU)

    new

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Quantum Theory and SpectroscopyMolecular Dynamics Fast Fourier Transform

    HPC

    Content-Centric Networking

    Cloud StorageDoppler ASIC Boba FPGA

    MPI, OpenMP CUDAStatistic MechanicsGPU switch

    PacketShader

    Introduction to Quantum Cloud

    With background from Quantum calculation, 1) we perform large-scale molecular dynamics simulation on HPC cluster using

    Amber and Gromacs, 2) we optimize Fourier transform and matrix operation on multicore system.

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Introduction to Quantum Cloud

    Then we found GPU is a great tool for both molecular dynamics and matrix operation.

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Introduction to Quantum Cloud

    Later we found similar systems with massive CPU cores.

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Introduction to Quantum Cloud

    Today we will show some practical example about our scale-out algorithm on these systems

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Num

    ber o

    f Cor

    es

    1

    10

    100

    1,000

    10,000

    100,000

    System Power Consumption (Watts)10 100 1000 10K 100k 1M

    System and Cores: Communication Matters

    QCF & SOCIONEXT

    PC Server

    Blade Server

    Super Computer

    General-purpose

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Num

    ber o

    f Cor

    es

    1

    10

    100

    1,000

    10,000

    100,000

    System Power Consumption (Watts)10 100 1000 10K 100k 1M

    System and Cores: Communication Matters

    QCF & SOCIONEXT

    PC Server

    Blade Server

    Super Computer

    GPU

    GPU Cluster

    General-purpose

    Special-purpose

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Num

    ber o

    f Cor

    es

    1

    10

    100

    1,000

    10,000

    100,000

    System Power Consumption (Watts)10 100 1000 10K 100k 1M

    System and Cores: Communication Matters

    QCF & SOCIONEXT

    PC Server

    Blade Server

    Super Computer

    GPU

    GPU Cluster

    General-purpose

    Special-purpose

    Traditional ARM Server

    ARM SoC

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Num

    ber o

    f Cor

    es

    1

    10

    100

    1,000

    10,000

    100,000

    System Power Consumption (Watts)10 100 1000 10K 100k 1M

    System and Cores: Communication Matters

    QCF & SOCIONEXT

    PC Server

    Blade Server

    Super Computer

    GPU

    GPU ClusterMicroslides

    Special-purpose

    General-purpose

    General-purpose

    Microslides of ARM CPU

    Microslides of ARM SoC

    Traditional ARM Server

    ARM SoC

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Num

    ber o

    f Cor

    es

    1

    10

    100

    1,000

    10,000

    100,000

    System Power Consumption (Watts)10 100 1000 10K 100k 1M

    System and Cores: Communication Matters

    QCF & SOCIONEXT

    PC Server

    Blade Server

    Super Computer

    GPU

    GPU ClusterMicroslides

    Microslides of ARM CPU

    Microslides of ARM SoC

    2006 20182012

    intra CPU connectioninter CPU connectioncluster connection

    Special-purpose

    General-purpose

    General-purpose

    Traditional ARM Server

    ARM SoC

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Data Communication Between Systems Is Obstacle

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1

    Cache/Storage

    I/O

    Hierarchical structure is critical for Von Neumann architecture

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Data Communication Between Systems Is Obstacle

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1

    Cache/Storage

    I/O

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Data Communication Between Systems Is Obstacle

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1instruction-level parallelism

    OS-level parallelism

    algorithm-level parallelism

    Cache/Storage

    I/O

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Data Communication Between Systems Is Obstacle

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1instruction-level parallelism

    OS-level parallelism

    algorithm-level parallelism

    batch, share-nothing stateless computing

    big RAM avoid context switching TLB, cache-conscious

    big.LITTLE GPU, FPGA

    Fast cache, cache prefetch Vector processing, SIMD/AVX

    Cache/Storage

    I/O

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Data Communication Between Systems Is Obstacle

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1

    cores

    Intra CPU Fabric

    Sockets Bus

    Memory

    Networking

    Cache L2/L3

    Cache L1instruction-level parallelism

    OS-level parallelism

    algorithm-level parallelism

    batch, share-nothing stateless computing

    big RAM avoid context switching TLB, cache-conscious

    big.LITTLE GPU, FPGA

    Fast cache, cache prefetch Vector processing, SIMD/AVX

    Cache/Storage

    I/O

    Consolidation will be the next-wave innovation for Chip design and system optimization • IO consolidation: networking, bus, fabric • storage consolidation: memory, cache, networking buffer

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Parallel and Scaling

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Fabric-Based ARM SoC

    From SOCIONEXT

    • PCIe Fabric for networking • 768 cores • c2c 10Gbps, 36 microsec latency • 1TB DDR4 RAM • 700 watts TDP per chassis

    watt/coreARM SoC 1

    x86 16 ~ 25GPU 0.3~0.5

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Cluster Management Tools

    PBS openstack kubernetes mesos

    basic batch process kvm container container/noncontainer

    provery fast

    very flexible normally with MPI

    very secure very stable

    system-level isolation

    fast secure

    production ready

    fast compatible with

    process and container production ready

    can be securecons no isolation high overhead slow

    container app not flexible enough complexity

    scenario scientific calculation private cloud application CI Datacenter OS

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Share-Nothing + Message Queue Architecture

    Stateless 计算架构

    host

    core core IO core use an “individual” core to do IO for the host to increase the throughput

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Example: PacketShader on GPU

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Example: Rendering on Arm

    Render@Baremetal

    Render@Container

    0

    1

    2

    3

    4

    buggy fishy cat bmps teeglasFX splash poked

    Intel ARM

    0

    0.5

    1

    1.5

    2

    bmw27 classroom bechmark

    Baremetal 1container 2container 4container

    并发情况下提⾼高3倍

    多实例例并发情况下提⾼高1.8倍

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Example: Rendering on Arm

    0

    7.5

    15

    22.5

    30

    performace scaled 1 scaled 2Intel arm SoC Intel arm SoC Intel arm SoC

    scaled 1: scaled performance with frequency and core number scaled 2: scaled performance with frequency and core number and watts

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Example: AI on Arm

    Caffe@Container ARM vs Intel vs GPU (scaled)

    0

    0.4

    0.8

    1.2

    1.6

    CIFAR 10 - 1 CIFAR 10 -2 CIFAR 10 - 3

    Intel ARM GPU 1070

  • Quantum Cloud Future (Beijing) Technology Co. Ltd.

    Example: AI on Arm SoC

    048

    1216

    caffe scaled caffe darknet scaled darknetIntel SoC Intel SoC Intel SoC Intel SoC

    02.254.5

    6.759

    caffe scaled caffe darknet scaled darknetIntel SoC Intel SoC Intel SoC Intel SoC

    Training

    Inference

  • 量量⼦子云未来(北北京)信息科技有限公司(以下称量量⼦子云)是⼀一家以影视⾏行行业为主的垂直⾏行行业云计算公司。

    量量⼦子云专注于影视⾏行行业的云化,和国际知名影视公司和特效制作公司合作,为影视⾏行行业客户提供制作软件、图形⼯工作站、⾼高性能存储、渲染服务等⼀一站式解决⽅方案等。

    ADDRESS 北北京市朝阳区⼯工体北北路路8号三⾥里里屯SOHO办公A座2101NUMBER

    EMAIL [email protected] WEBSITE

    010-53518265

    www.lzyco.com

    THANKS