53
A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms Yu (Emma) Wang, Gu-Yeon Wei, David Brooks Harvard University 3/3/2020 Contact: [email protected] ParaDnn github.com/Emma926/paradnn

P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Yu (Emma) Wang, Gu-Yeon Wei, David BrooksHarvard University

3/3/2020Contact: [email protected]

ParaDnngithub.com/Emma926/paradnn

Page 2: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Acknowledgement

Frank Chen, Glenn Holloway, Dan Janni, Peter Mattson, Lifeng Nai, David Patterson, Francesco Pontiggia, Parthasarathy Ranganathan, Vijay Reddi, Brennan Saeta, Zak Stone, Anitha Vijayakumar, Shibo Wang,Qiumin Xu, Doe Hyun Yoon, Cliff Young

Page 3: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Challenges with ML Benchmarking

● Diversity in deep learning models used○ Problem Domains, Models, Datasets

● Pace of field○ State-of-the-art models evolve every few months

● Varying evaluation metrics○ Accuracy, Time to train, Latency of inference

● Multi-disciplinary field○ Algorithms, Systems, Hardware, ML Software Stacks

Page 4: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

State of the art: MLPerf 0.6

Area Benchmark Dataset Model Reference Implementation

Vision Image classification ImageNet ResNet-50 TensorFlow

Object detection COCO 2017 Mask R-CNN Pytorch

Object detection COCO 2017 SSD-ResNet34 Pytorch

Language/Audio

Translation WMT Eng-Germ Transformer TensorFlow

Speech recognition WMT Eng-Germ GNMT PyTorch

Commerce Recommendation MovieLens-20M NCF PyTorch

Action Reinforcement Learning Go Mini-go TensorFlow

Page 5: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

State of the art: MLPerf 0.6

Area Benchmark Dataset Model Reference Implementation

Vision Image classification ImageNet ResNet-50 TensorFlow

Object detection COCO 2017 Mask R-CNN Pytorch

Object detection COCO 2017 SSD-ResNet34 Pytorch

Language/Audio

Translation WMT Eng-Germ Transformer TensorFlow

Speech recognition WMT Eng-Germ GNMT PyTorch

Commerce Recommendation MovieLens-20M NCF PyTorch

Action Reinforcement Learning Go Mini-go TensorFlow

Page 6: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Our Methodology

ParaDnn

Page 7: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Our Methodology

ParaDnn

Page 8: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

ParaDnn vs MLPerf

- Avoid drawing conclusions based on several arbitrary models

- Generate thousands of parameterized, end-to-end models

- Prepare hardware designs for future models

- Complement the use of existing real-world models, i.e. MLPerf

- Good for studying accuracy or convergence with real datasets

- Represent the specific models some people care about

ParaDnn

Page 9: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

ParaDnn Canonical Models

Fully Connected (FC)

CNNs: Residual, Bottleneck

RNNs: RNN, LSTM, GRU

# of Nodes # of NodesInput Output# of Layers

# of Res/Bottleneck Blocks (filter size)Input OutputFC Layerx 4

RNN or LSTM or GRU cell (size)Input Output# of Layers

RNN or LSTM or GRU cell

Page 10: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Models

Page 11: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Models

- ParaDnn covers a larger range than the real models- from 10k to ~1 billion parameters

Page 12: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Analysis Enabled by ParaDnn

- Roofline analysis of TPU v2- Homogenous Platform Comparison: TPU v2 vs v3- Heterogeneous Platform Comparison: TPU vs GPU

Page 13: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

The Roofline Model

13David Brooks, Gu-Yeon Wei

Page 14: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

The Roofline Model

14David Brooks, Gu-Yeon Wei

Peak FLOPS

Page 15: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

The Roofline Model

15David Brooks, Gu-Yeon Wei

Peak FLOPS

Memory Bandwidth

Page 16: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

The Roofline Model

16David Brooks, Gu-Yeon Weicompute-intensive

Page 17: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

The Roofline Model

17David Brooks, Gu-Yeon Weicompute-intensivememory-intensive

Page 18: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Transformer

18David Brooks, Gu-Yeon Wei

Page 19: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

FC Models

19David Brooks, Gu-Yeon Wei

ParaDnn sweeps a large range of models, from memory-bound to compute-bound.

Page 20: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

FC Models

20David Brooks, Gu-Yeon Wei

Compute-bound

Page 21: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

FC Models

21David Brooks, Gu-Yeon Wei

Memory-bound

Page 22: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

TPU v2 vs v3?

22

Page 23: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

How to upgrade to TPU v3?

23

TPU v2

Page 24: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

How to upgrade to TPU v3?

24

TPU v2TPU v3 (FLOPS )

Page 25: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

How to upgrade to TPU v3?

25

TPU v2TPU v3 (FLOPS )

TPU v3 (Mem BW )

Page 26: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

How to upgrade to TPU v3?

26

TPU v2TPU v3 (Mem BW )

TPU v3 (FLOPS )

TPU v3 (FLOPS Mem BW )

Page 27: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

How to upgrade to TPU v3?

27

TPU v2? x

? x

TPU v3 (FLOPS Mem BW )

Page 28: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Architecture of TPU v2 vs v3

28Figure is from https://cloud.google.com/tpu/docs/system-architecture

180 TFLOPS / Board

420 TFLOPS / Board

Page 29: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Google’s Choice of TPU v3

29

TPU v2

TPU v32.3 x

? x

Page 30: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

TPU v3 vs v2: FC Operation Breakdown

30

Page 31: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

TPU v3 vs v2: FC Operation Breakdown

31

Compute-bound: 2.3x speedup

Page 32: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

TPU v3 vs v2: FC Operation Breakdown

32

Memory-bound: 1.5x speedup

Page 33: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

TPU v3 vs v2: FC Operation Breakdown

33

Memory-bound, but benefit from 2x memory capacity:

3x speedup

Page 34: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Google’s Choice of TPU v3

34

TPU v2

TPU v32.3 x

1.5 x

Page 35: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

TPU v3 vs v2: FC Operation Breakdown

35

ParaDnn provides diverse set of operations, and shows different operations are sensitive to different system component upgrades.

Page 36: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

TPU vs GPU?

Page 37: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Hardware Platforms

37

Page 38: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Hardware Platforms

38

300 GB/s per core

Page 39: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

FC and CNNFC

FC

W

A

FCGradient

Weighted Sum

G

Page 40: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

FC and CNNFC CNN

FC

W

A

FCGradient

Weighted Sum

G

ConvA

ConvGradient

Weighted Sum

G

W Fewer Weights

Larger Conv ops

Page 41: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Hardware Platforms

41

300 GB/s per core

Page 42: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

FC TPU/GPU Speedup colored with Batch Size

9

0.35

42

Page 43: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

FC TPU/GPU Speedup colored with Batch Size

9

0.35

TPU is better

GPU is better

43

Page 44: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

FC TPU/GPU Speedup colored with Batch Size

9

0.35

TPU is better

GPU is better

44

Page 45: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

FC TPU/GPU Speedup colored with Node Size

9

45

More nodes More weights More memory-bound

Page 46: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Hardware Platforms

46

300 GB/s per core

1.44x

Page 47: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

CNN TPU/GPU Speedup colored with Batch Size

47

Page 48: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

CNN TPU/GPU Speedup colored with Batch Size

- Up to 6x speedup- TPU architecture and software

is highly optimized for CNNs

48

Page 49: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

CNN TPU/GPU Speedup colored with Batch Size

- All models runs faster on TPU.- Larger batch sizes lead to

higher speedups.

49

Page 50: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

CNN TPU/GPU Speedup colored with Filters

- More filters have higher speedup lower bounds

50

Page 51: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Conclusion

- Parameterized methodology: ParaDnn + a set of analysis methods- Single platform analysis: TPU v2- Homogenous platform comparison: TPU v2 vs v3- Heterogeneous platform comparison: TPU vs GPU

Page 52: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Limitations of this Work- Does not include:

- Inference- Multi-node system: multi-GPU, or TPU pods- Accuracy, convergence- Cloud overhead

- Tractability- Limit the range of hyperparameters and datasets

- Small batch sizes (<16) and large batch sizes (> 2k) are not studied- Synthetic datasets do not include data infeed overhead

- Iterations of TPU loop is 100. Larger numbers can slightly increase the performance.

Page 53: P A Systematic Methodology for Analysis of Deep DLearning Hardware …03-16-30)-03-16-55... · A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Questions?

ParaDnnAvailable: github.com/Emma926/paradnn