Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 2
EFFICIENCYFLEXIBILITY
FPGAsControl Unit (CU)
Registers
Arithmetic Logic
Unit (ALU)
CPUs GPUsASICsNPUs
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 3
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 4
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 5
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 6
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 7
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 8
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 9
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 10
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 11
https://www.usenix.org/conference/nsdi18/presentation/firestone
https://www.microsoft.com/en-us/research/uploads/prod/2018/06/ISCA18-Brainwave-CameraReady.pdf
https://github.com/opencomputeproject/Project-Zipline2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 12
Project Brainwave
BrainwaveA Scalable DNN Serving PlatformFast:FlexibleFriendly:
F F F
L0
L1
F F F
L0
Pretrained DNN Modelin ONNX, etc.
Scalable DNN Hardware Microservice
BrainwaveSoft NPU
Instr Decoder & Control
BW NPU
Network switches
FPGAs
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 14
15
Sub-millisecond FPGA compute latencies at batch 1
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft.
Brainwave Soft NPU
FPGA
NeuralFunctional
Unit
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 16
IN0: O(1)IN1: O(1)
IN0: O(N)IN1: O(N)
IN0: O(N)IN1: O(1)
Batched FCsBatched RNNs
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 17
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 18
192019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft.
1. void LSTM(int steps) {2. for (int t = 0; t < steps; t++) {3. v_rd(NetQ);4. v_wr(InitialVrf, xt);5. v_rd(InitialVrf, xt);6. mv_mul(Wf);7. vv_add(bf);8. v_wr(AddSubVrf, xWf);9. v_rd(InitialVrf, h_prev);10. mv_mul(Uf);11. vv_add(xWf);12. v_sigm();13. vv_mul(c_prev);14. v_wr(AddSubVrf, ft_mod);15. v_rd(InitialVrf, h_prev);
16. mv_mul(Uc);17. vv_add(xWc);18. v_tanh();19. vv_mul(it);20. vv_add(ft_mod);21. v_wr(MultiplyVrf, c_prev);22. v_wr(InitialVrf, ct);23. v_rd(InitialVrf, ct);24. v_tanh();25. vv_mul(ot);26. v_wr(InitialVrf, h_prev);27. v_wr(NetQ);28. }29. }
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 20
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 21
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft.
16X more efficient (area & energy)
Matrix-VectorMultiplier
Tile Engine
VRF
...
Accumulator
Vector Add-Reduction
Sche
dule
r
Float16 to MSFPConverter
MSFP to Float16Converter
...
Float16 Vector Input
Float16 Vector Output
Tile Engine
VRF
Tile Engine
VRF
Accumulator Accumulator
MRF MRF MRF
22
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft.
https://www.microsoft.com/en-us/research/uploads/prod/2018/06/ISCA18-Brainwave-CameraReady.pdf
Device Logic SRAM DSPs MHz Peak TFLOPsStratix V D5 87% 59% 66% 200 2.4Arria 10 1150 51% 80% 100% 300 9.8Stratix 10 280 91% 69% 91% 250 48
Benchmark Parameters Device Latency (ms) TFLOPs % UtilizationGRU h=2816, t=750 BW 1.987 35.92 74.8GRU h=2816, t=750 Titan Xp 178.6 0.4 3.3GRU h=2560, t=375 BW 0.993 29.69 61.8GRU h=2560, t=375 Titan Xp 74.62 0.4 3.3GRU h=2048,t=375 BW 0.954 19.79 41.2GRU h=2048,t=375 Titan Xp 51.59 0.37 3GRU h=1536, t=375 BW 0.951 11.17 23.3GRU h=1536, t=375 Titan Xp 31.73 0.33 2.8GRU h=1024, t=1500 BW 3.792 4.98 10.4GRU h=1024, t=1500 Titan Xp 59.51 0.32 2.6GRU h=512, t=1 BW 0.013 0.25 0.5GRU h=512, t=1 Titan Xp 0.06 0.05 0.4
Titan Xp BW_S10Numerical Type Float32 MSFP8Peak TFLOPs 12.1 48TDP (W) 250 125Process (nm) TSMC 16 Intel 14
23
Stratix V RNN-optimized NPUs in Scale Production
https://blogs.bing.com/search/2017-12/search-2017-12-december-ai-update
https://www.microsoft.com/en-us/research/uploads/prod/2018/03/mi0218_Chung-2018Mar25.pdf
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 24
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 25
Corsica/Zipline
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 27
Corsica
Sys Mgmt
PCIe Iface
SR-IOV
Application Interface
SR-IOV, Security, Scheduling/QoSCmd Processing
DMA Engine
Completions & Notifications
Key Management
and Key Derivation
CfgKey Vault
Security and Management
Processor
Fast Path Slow Path
4 Compression & 4 Decompression
PipelinesCompress / Decompress
Encrypt / DecryptAuthentication
Zipline Fast Path
Clo
ud S
ocke
t
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 28
Compress Data: XP10, Gzip, Zlib Encrypt Data: AES-GSM, XTS, XEX Authenticate Data: CRC, SHA2, HMAC-SHA2
Decrypt Data Decompress Data Authenticate Data
KeyGeneration
Key Vault
CRC-64(Data)
CRC-64(Data)
CRCon
Raw Data
=?
command
dataCRCon
Raw Data
Dup KeyGeneration
PerformThe Work
CheckThe Work
Redundancy
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 29
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 30
Tota
l Thr
ough
put
20 Gbps
40 Gbps
60 Gbps
80 Gbps
100 Gbps
Compression Ratio
FPGA XP91 engine
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 32
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 33
2019 HotChips Tutorial: Cloud Acceleration at Microsoft © 2019 Microsoft. 34