35

2019 HotChips Tutorial: Cloud Acceleration at Microsoft ......GRU h=1536, t=375 BW 0.951 11.17 23.3 GRU h=1536, t=375 Titan Xp 31.73 0.33 2.8 GRU h=1024, t=1500 BW 3.792 4.98 10.4

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 2

  • EFFICIENCYFLEXIBILITY

    FPGAsControl Unit (CU)

    Registers

    Arithmetic Logic

    Unit (ALU)

    CPUs GPUsASICsNPUs

    2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 3

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 4

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 5

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 6

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 7

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 8

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 9

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 10

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 11

  • https://www.usenix.org/conference/nsdi18/presentation/firestone

    https://www.microsoft.com/en-us/research/uploads/prod/2018/06/ISCA18-Brainwave-CameraReady.pdf

    https://github.com/opencomputeproject/Project-Zipline2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 12

  • Project Brainwave

  • BrainwaveA Scalable DNN Serving PlatformFast:FlexibleFriendly:

    F F F

    L0

    L1

    F F F

    L0

    Pretrained DNN Modelin ONNX, etc.

    Scalable DNN Hardware Microservice

    BrainwaveSoft NPU

    Instr Decoder & Control

    BW NPU

    Network switches

    FPGAs

    2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 14

  • 15

    Sub-millisecond FPGA compute latencies at batch 1

    2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft.

  • Brainwave Soft NPU

    FPGA

    NeuralFunctional

    Unit

    2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 16

  • IN0: O(1)IN1: O(1)

    IN0: O(N)IN1: O(N)

    IN0: O(N)IN1: O(1)

    Batched FCsBatched RNNs

    2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 17

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 18

  • 192019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft.

  • 1. void LSTM(int steps) {2. for (int t = 0; t < steps; t++) {3. v_rd(NetQ);4. v_wr(InitialVrf, xt);5. v_rd(InitialVrf, xt);6. mv_mul(Wf);7. vv_add(bf);8. v_wr(AddSubVrf, xWf);9. v_rd(InitialVrf, h_prev);10. mv_mul(Uf);11. vv_add(xWf);12. v_sigm();13. vv_mul(c_prev);14. v_wr(AddSubVrf, ft_mod);15. v_rd(InitialVrf, h_prev);

    16. mv_mul(Uc);17. vv_add(xWc);18. v_tanh();19. vv_mul(it);20. vv_add(ft_mod);21. v_wr(MultiplyVrf, c_prev);22. v_wr(InitialVrf, ct);23. v_rd(InitialVrf, ct);24. v_tanh();25. vv_mul(ot);26. v_wr(InitialVrf, h_prev);27. v_wr(NetQ);28. }29. }

    2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 20

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 21

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft.

    16X more efficient (area & energy)

    Matrix-VectorMultiplier

    Tile Engine

    VRF

    ...

    Accumulator

    Vector Add-Reduction

    Sche

    dule

    r

    Float16 to MSFPConverter

    MSFP to Float16Converter

    ...

    Float16 Vector Input

    Float16 Vector Output

    Tile Engine

    VRF

    Tile Engine

    VRF

    Accumulator Accumulator

    MRF MRF MRF

    22

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft.

    https://www.microsoft.com/en-us/research/uploads/prod/2018/06/ISCA18-Brainwave-CameraReady.pdf

    Device Logic SRAM DSPs MHz Peak TFLOPsStratix V D5 87% 59% 66% 200 2.4Arria 10 1150 51% 80% 100% 300 9.8Stratix 10 280 91% 69% 91% 250 48

    Benchmark Parameters Device Latency (ms) TFLOPs % UtilizationGRU h=2816, t=750 BW 1.987 35.92 74.8GRU h=2816, t=750 Titan Xp 178.6 0.4 3.3GRU h=2560, t=375 BW 0.993 29.69 61.8GRU h=2560, t=375 Titan Xp 74.62 0.4 3.3GRU h=2048,t=375 BW 0.954 19.79 41.2GRU h=2048,t=375 Titan Xp 51.59 0.37 3GRU h=1536, t=375 BW 0.951 11.17 23.3GRU h=1536, t=375 Titan Xp 31.73 0.33 2.8GRU h=1024, t=1500 BW 3.792 4.98 10.4GRU h=1024, t=1500 Titan Xp 59.51 0.32 2.6GRU h=512, t=1 BW 0.013 0.25 0.5GRU h=512, t=1 Titan Xp 0.06 0.05 0.4

    Titan Xp BW_S10Numerical Type Float32 MSFP8Peak TFLOPs 12.1 48TDP (W) 250 125Process (nm) TSMC 16 Intel 14

    23

  • Stratix V RNN-optimized NPUs in Scale Production

    https://blogs.bing.com/search/2017-12/search-2017-12-december-ai-update

    https://www.microsoft.com/en-us/research/uploads/prod/2018/03/mi0218_Chung-2018Mar25.pdf

    2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 24

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 25

  • Corsica/Zipline

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 27

  • Corsica

    Sys Mgmt

    PCIe Iface

    SR-IOV

    Application Interface

    SR-IOV, Security, Scheduling/QoSCmd Processing

    DMA Engine

    Completions & Notifications

    Key Management

    and Key Derivation

    CfgKey Vault

    Security and Management

    Processor

    Fast Path Slow Path

    4 Compression & 4 Decompression

    PipelinesCompress / Decompress

    Encrypt / DecryptAuthentication

    Zipline Fast Path

    Clo

    ud S

    ocke

    t

    2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 28

  • Compress Data: XP10, Gzip, Zlib Encrypt Data: AES-GSM, XTS, XEX Authenticate Data: CRC, SHA2, HMAC-SHA2

    Decrypt Data Decompress Data Authenticate Data

    KeyGeneration

    Key Vault

    CRC-64(Data)

    CRC-64(Data)

    CRCon

    Raw Data

    =?

    command

    dataCRCon

    Raw Data

    Dup KeyGeneration

    PerformThe Work

    CheckThe Work

    Redundancy

    2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 29

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 30

  • Tota

    l Thr

    ough

    put

    20 Gbps

    40 Gbps

    60 Gbps

    80 Gbps

    100 Gbps

    Compression Ratio

    FPGA XP91 engine

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 32

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 33

  • 2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 34