synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects

for DNN Training

ERIC QIN*, ANANDA SAMAJDAR*, HYOUKJUN KWON*, VINEET NADELLA*, SUDARSHAN SRINIVASAN#, DIPANKAR DAS#,

BHARAT KAUL#, TUSHAR KRISHNA*

http://synergy.ece.gatech.edu

* GEORGIA TECH# INTEL

1

http://synergy.ece.gatech.edu/

Outline• Motivation

○ GEMMs in Deep Learning

○ Utilization on TPU (Systolic Array)

• Accelerator Requirements

• SIGMA

○ Interconnects Implementations

○ Full System Design

• Evaluation

• Conclusion

2

Outline

3

• Motivation




• SIGMA



• Evaluation

• Conclusion

Deep Learning Applications

Speech Recognition Recommender SystemsLanguage Understanding

4

Deep Learning Applications

Speech Recognition Recommender SystemsLanguage Understanding

5

What is the key computation for these Deep Learning applications?

Runtime breakdown on V100 GPU

6

77 % 64 %

Transformer (Language Understanding)

GNMT (Machine Translation)

Runtime breakdown on V100 GPU

7

77 % 64 %

Transformer (Language Understanding)

GNMT (Machine Translation)

Matrix multiplications (GEMMs) consume around 70% of the total runtime when training modern deep learning workloads.

GEMMs in Deep Learning

M

K

K

N N

MForward Pass(Inference and Training)

Input Feature Map

Output Feature MapWeights

GEMM MNK Dimension

Representation

M dim: batch size

N dim: number of channels in the next layer

K dim: [H * W * C]

8Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.


Backward Pass(Training)

M

K

K

N N


Input Feature Map


N

M

K

NWeightsT Output Feature Error

M

K

Input Feature Error

GEMM MNK Dimension

Representation

M dim: batch size


K dim: [H * W * C]

9Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.



M

K

K

N N


Input Feature Map


N

M

K

NWeightsT

Gradient Computation(Training)

Output Feature Error

N

M Output Feature Error

M

K

InputFeature

MapT

M

K

Input Feature Error

K

N

𝚫Weights

GEMM MNK Dimension

Representation

M dim: batch size


K dim: [H * W * C]

Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 10



M

K

K

N N


Input Feature Map


N

M

K

NWeightsT

Gradient Computation(Training)

Output Feature Error

N

M Output Feature Error

M

K

InputFeature

MapT

M

K

Input Feature Error

K

N

𝚫Weights

GEMM MNK Dimension

Representation

M dim: batch size


K dim: [H * W * C]

Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 11

GEMM is a key compute primitive to accelerate in hardware to speed up training.

Hardware for Accelerating GEMMs

12

SIMT Architectures

Nvidia GTX GPUs

13

SIMD ArchitecturesSIMT Architectures

Microsoft Brainwave ARM Trillium

Tesla FSDC

Nvidia GTX GPUs


14

SIMD ArchitecturesSIMT Architectures Systolic Architectures


Tesla FSDC

Nvidia GTX GPUs

Google TPU

Xilinx xDNN Nvidia Tensor Cores


15

SIMD ArchitecturesSIMT Architectures Systolic Architectures

Recently, systolic array based architectures are popular for accelerating GEMMs.


Tesla FSDC

Nvidia GTX GPUs

Google TPU

Xilinx xDNN Nvidia Tensor Cores


Target comparison: Google TPU

TPU figure from https://cloud.google.com/tpu/docs/system-architecture

Our target comparison is with the Google TPU, which uses 128 x 128 systolic arrays.

16

https://cloud.google.com/tpu/docs/system-architecture





• SIGMA



• Evaluation

• Conclusion

17

TPU (Systolic Array)

Systolic Array Architectures

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

18

TPU (Systolic Array) MAC PE

Systolic Array Architectures

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

19

＋✕

Weight Buffer

Top input

Streaming input

Streaming output

Load (0) / Accum. (1)

Load (0) / Accum. (1)

0

1

Bottom output


Mapping GEMMs onto TPUs

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

20

Workload Application Example DimensionsM N K

GNMT Machine Translation

128 2048 4096320 3072 40961632 36548 10242048 4096 32

DeepBench General Workload

1024 16 50000035 8457 2560

Transformer Language Understanding

31999 1024 8484 1024 4096

NCF Collaborative Filtering

2048 1 128256 256 2048

GEMMs used for evaluation.



128 2048 4096320 3072 40961632 36548 10242048 4096 32


1024 16 50000035 8457 2560


31999 1024 8484 1024 4096


2048 1 128256 256 2048



21

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

Let’s map this GEMM!




StationarymatrixK

= 2

048

N = 256

22

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127

** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)



StationarymatrixK

= 2

048

N = 256

23

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128




StationarymatrixK

= 2

048

N = 256

24

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128




StationarymatrixK

= 2

048

N = 256

25

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128

128




26

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256


StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128



27

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

The streaming elements get multiplied with the stationary elements. The partial sums

get accumulated down each column.


StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128



28

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256


StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128





29

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256


StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128





30

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256


StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128





31

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256

Reduce partial sum down each column.


StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128





32

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256



StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128


get accumulated down a column.

Systolic Arrays are popular because they enable efficient data reuse and are very simple to implement.



128 2048 4096320 3072 40961632 36548 10242048 4096 32


1024 16 50000035 8457 2560


31999 1024 8484 1024 4096


2048 1 128256 256 2048

Mapping GEMMs onto TPUs - Irregularity

33


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Let’s map another GEMM!** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)


34

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950



35

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

128


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950



36


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

75% of the PEs are not utilized for this GEMM.

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

128



37

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

K = 32

N = 4096

K =

32

128


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048




38


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128


K = 32



39


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128


K = 32



40


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128


K = 32




41


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128


K = 32




42


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128


K = 32



The rigid structure of Systolic Arrays cause PE underutilization. How can we enable the remaining PEs?

43


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32


Can we map another section of the stationary matrix onto the TPU to increase utilization?


44


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32




128

45

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950



128

46

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

duplicate streaming data



128

47

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

duplicate streaming data



128

48

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048



128

49

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048



128


50

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


This is incorrect functionally, because the systolic reduction will accumulate dark blue and light purple partial product clusters together. This is due to a rigid aspect ratio of a systolic array. Can we map another section of

the stationary matrix onto the TPU to increase utilization?


128

51

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


No, systolic reduction will accumulate dark blue and light purple partial product clusters together, which is incorrect functionally. This is due to a rigid aspect ratio of a systolic array.



Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.

Observation 2: Sparse weights cause underutilization of the PEs.

Observation 3: Large systolic arrays have large load and reduction latency.

Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both

sparse and irregular.

Sparsity in DNN Models

GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)

52



Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)

53



Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)

54

Weight sparsity ranges from 40% to 90%. Activation sparsity is approximately

30% to 70% from ReLU, dropout, etc.



128 2048 4096320 3072 40961632 36548 10242048 4096 32


1024 16 50000035 8457 2560


31999 1024 8484 1024 4096


2048 1 128256 256 2048

Mapping GEMMs onto TPUs - Sparsity

55


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950

Usually these GEMMs are sparse!



56

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32


What happens if half of the stationary matrix are zeros?


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950


57

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32




15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


58

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32




15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


59

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32




15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Multiplication with an operand that is zero is considered underutilized.


60


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32


87.5% of the PEs are not utilized for this GEMM.



61


15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32


87.5% of the PEs are not utilized for this GEMM.



Weight stationary systolic arrays are underutilized for sparse GEMMs because they have to map zeros. How

can we map only nonzeros stationary?

62

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048

Can we map other nonzero elements where the idle PEs used to be?


63

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048



128

64

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048



128

65

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048



128

66

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048




128

67

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.



128

68

Stationary matrix

M =

204

8

Stre

amin

g m

atri

x

N = 4096

K =

32

128

K = 32



15

31

47

63

79

95

111

127

15 31 47 63 79 111 127950M = 2048


Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.




Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.

Observation 3: Large systolic arrays have large load and reduction latency.




Mapping GEMMs onto TPUs - Scalability

StationarymatrixK

= 2

048

N = 256

69

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128



StationarymatrixK

= 2

048

N = 256

70

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128


Cycle 1



StationarymatrixK

= 2

048

N = 256

71

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128


Cycle 2



StationarymatrixK

= 2

048

N = 256

72

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128


Cycle 3



StationarymatrixK

= 2

048

N = 256

73

K = 2048

M =

256

Streaming matrix

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127 128

128


Cycle 128Loading takes 128 cycles, which

scales at O(sqrtN).



74

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256



StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128

Reduction will take 128 cycles for the last accumulate to finish before loading a new

portion of the stationary matrix.



75

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256



StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128






Observation 3: Large systolic arrays have significant load and reduction latency.




76

0

1

2

3

4

.

.

127

0 1 2 3 4 . . 127M = 256



StationarymatrixK

= 2

048

N = 256

K = 2048

M =

256

Streaming matrix

128

128

128






Observation 3: Large systolic arrays have significant load and reduction latency.

Takeaway: Systolic Arrays are underutilized on emerging GEMM workloads that are

both sparse and irregular.





• SIGMA



• Evaluation

• Conclusion

77

Key Accelerator Requirements

Requirement Systolic Array Limitation SIGMA Desired Traits

Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio

Sparsity Support ● Data forwarding every cycle in horizontal / vertical direction requires systolic array to map zeros.

● Support sparsity by only mapping nonzeros

● ability to create simultaneous variable sized dot product

Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction

● O(1) distribution● O(log2N) reduction

78




Sparsity Support ● data forwarding every cycle requires systolic array to map zeros

● sparsity support by mapping only nonzeros stationary



● O(1) distribution● O(log2N) reduction

79





● sparsity support by mapping only nonzeros stationary



● O(1) distribution● O(logN) reduction

80





● sparsity support by mapping only nonzeros



● O(1) distribution● O(logN) reduction

81

With flexible and scalable interconnects between all PEs, SIGMA can solve the three requirements.

Systolic vs SIGMA High Level Interconnects

A B C D

I J K L

4 x 4 Systolic Array

82

A B C D

I J K L

E F G H

M N O P

16 PE SIGMA

● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and

reduction


83

16 PE SIGMA

A B C D

I J K L

E F G H

Distribution Network

M N O P

** Microarchitecture details on the networks will be discussed later

A B C D

I J K L



reduction


84

16 PE SIGMA

A B C D

I J K L

E F G H


M N O P


A B C D

I J K L


● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs

● Ability to send any streaming element to any PE

● O(1) distribution


reduction


85


A B C D

I J K L

4 x 4 Systolic Array 16 PE SIGMA

A B C D

I J K L

E F G H


Reduction NetworkM N O P





reduction


86


A B C D

I J K L

4 x 4 Systolic Array 16 PE SIGMA

A B C D

I J K L

E F G H






● Reduction network allows SIGMA to reduce variable sized dot products

● Addresses sparsity and irregularity

● O(logN) reduction


reduction

A B C D

I J K L


a

b

c

d

e

f

g

h

87

Irregular GEMMs on SIGMA

Set up the stationary values.

** Assuming MK matrix is

streaming and KN matrix is

stationary. (aka weight

stationary)

A B C D

I J K L

A B C D

I J K L

E F G H




a

b

c

d

a

b

c

d

e

f

g

h

e

f

g

h

88

16 PE SIGMA






stationary)

A B C D

I J K L

A B C D

I J K L

E F G H




a

b

c

d

a

b

c

d

e

f

g

h

e

f

g

h

a

b

a

b

a

b

a

b

a

b

a

b

a

b

89

16 PE SIGMA


Next cycle: Multicast first row of MK to the corresponding stationary elements.




stationary)

A B C D

I J K L

A B C D

I J K L

E F G H




a

b

c

d

c

d

e

f

g

h

e

f

g

h

c

d

c

d

c

d

c

d

d

c

c

d

c

d

90

16 PE SIGMA


Next cycle: Multicast second row of MK to the corresponding stationary elements.




stationary)

A B C D

I J K L

A B C D

I J K L

E F G H




a

b

c

d

e

f

e

f

g

h

g

h

e

f

e

f

e

f

e

f

f

e

e

f

e

f

91

16 PE SIGMA


Next cycle: Multicast third row of MK to the corresponding stationary elements.




stationary)

A B C D

I J K L

A B C D

I J K L

E F G H




a

b

c

d

g

h

e

f

g

h

g

h

g

h

g

h

g

h

h

g

g

h

g

h

92

16 PE SIGMA


Next cycle: Multicast fourth row of MK to the corresponding stationary elements.




stationary)

E F G H

M N O P

A B C D

I J K L

E F G H




g

h

g

h

g

h

g

h

g

h

h

g

g

h

g

h

a

b

c

d

e

f

g

h

93

16 PE SIGMA


After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again (referred to as folding).




stationary)

E F G H

M N O P

A B C D

I J K L

E F G H




g

h

g

h

g

h

g

h

g

h

h

g

g

h

g

h

a

b

c

d

e

f

g

h

94

16 PE SIGMA


After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again.




stationary)

SIGMA reduces the number of folds, which then reduces the number of memory references on the streaming matrix.

E F G H

M N O P

A B C D

I J K L

E F G H




95

16 PE SIGMA


Final cycle count.

Systolic Array total runtime:

24 cycles

SIGMA total runtime:

13 cycles




stationary)

E F G H

M N O P

A B C D

I J K L

E F G H




96

16 PE SIGMA


Final cycle count.


24 cycles


13 cycles




stationary)

SIGMA maximizes PE utilization with its flexible interconnects for irregular GEMMs.

Sparse Irregular GEMMs on SIGMA

A B C D

H I J


a

b

cd

f

g

e

97





stationary)

A B C D

H I J

A B J C

H I L K

D E F G




a

b

cd

f

g

e

a

b

cd

f

g

e

98

16 PE SIGMA






stationary)

A B C D

H I J

A B J C

H I L K

D E F G




a

b

a

b

cd

f

g

e

cd

f

g

e

a

b

a

b

b

b

a

b

a

b

a

b

a

b

99

16 PE SIGMA


Next cycle: Multicast first row of MK to the corresponding stationary elements.




stationary)

A B C D

H I J

A B J C

H I L K

D E F G




a

b

ccd

f

g

e

d

f

g

e

c c

c c c c

100

16 PE SIGMA


Next cycle: Multicast second row of MK to the corresponding stationary elements.




stationary)

A B C D

H I J

A B J C

H I L K

D E F G




a

b

dcd

f

g

e

g

e

d d

d d d d

101

16 PE SIGMA


Next cycle: Multicast third row of MK to the corresponding stationary elements.




stationary)

A B C D

H I J

A B J C

H I L K

D E F G




a

b e

cd

f

g

e e e

e

e

e e e e

102

16 PE SIGMA


Next cycle: Multicast fourth row of MK to the corresponding stationary elements.




stationary)

C D E

K L M N

A B J C

H I L K

D E F G




e e e

e

e

e e e e

a

b

cd

f

g

e

103

16 PE SIGMA


After accumulation, SIGMA is done. However, the systolic array has to map the other part of the stationary matrix and stream in the MK matrix again (referred to as folding).




stationary)

F G

O P

A B J C

H I L K

D E F G



16 PE SIGMA Flex-DPE4 x 4 Systolic Array

e e e

e

e

e e e e

a

b

cd

f

g

e

104

16 PE SIGMA


Again, the systolic array has to map another part of the stationary matrix and stream MK again.




stationary)

F G

O P

A B J C

H I L K

D E F G




105

16 PE SIGMA


Final cycle count.


34 cycles


13 cycles




stationary)

F G

O P

A B J C

H I L K

D E F G




106

16 PE SIGMA


Final cycle count.


34 cycles


13 cycles




stationary)

SIGMA maps only nonzeros stationary; therefore, reduces the number of folds needed.





• SIGMA



• Evaluation

• Conclusion

107

O(1) Distribution TopologyUnicast

(Loading Stationary Matrix)

Crossbar

X X X XX X X X X X X X X X X X

Benes Crossbar Benes

108

Source

Destination Destination Destination Destination

Source Source Source

Multicast (Sending Streaming Matrix)

O(1) Distribution Topology

Crossbar Benes

109

Source Source

Crossbar BenesSource Source



Crossbar Benes Crossbar Benes

Unicast (Loading Stationary Matrix)



Crossbar Benes

110

Source Source




Crossbar Benes Crossbar Benes

Unicast (Loading Stationary Matrix)



Crossbar Benes

Unicast Multicast

111

Source Source




SIGMA’s distribution can be either a Crossbar or Benes network. We chose Benes because the number of

switches scale by O(N logN).

O(logN) Reduction (Limitation of Adder Tree)

Log2(N) reduction time 112

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e

3

7

15 2-input BF16 Mult Switch2-input BF32 Adder Switch

11

25 13 21

Different clusters of partial sums.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+

+ +2-input BF16 Mult Switch2-input BF32 Adder Switch

x

x x+


1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15 2-input BF16 Mult Switch2-input BF32 Adder Switch

11

25 13 21

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30


Regular adder tree will accumulate a + b, which is incorrect functionality.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x

x x+

O(logN) Reduction (Limitation of Adder Tree)

Log2(N) reduction time

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15

N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF

11

25 13 21

Forwarding Adder Network (FAN)

114


x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x

x x+


0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21


115


FAN is optimized for floating point reductions, commonly used during DNN training.

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x

x x+


0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 17 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21

Pipelined at 1 cycle per adder level.


116



x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x

x x+


0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21

Cycle 1: Bypass conflicting partial sums

117


Bypass adder and forward partial sum to next level!



x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x


0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21

Cycle 2: Partial sum red complete

118



x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x

x x+

sent to output buffer



0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7


11

25 1 21

119


Cycle 3: Partial sums green and orange complete



x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x

x x+

sent to output buffer sent to output buffer


0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21

Cycle 4: Partial sum blue complete

120




x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x

x x+



0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21

121


Cycle 5: Partial sum purple complete



x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x

x x+



0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21

122





x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

+

x x xx+

+

+x x xx

+

+

+

+

x x xx+

+

+xx

+

+

+

+


x

x x+

send to output buffer

Note: The output buffer has FFs to maintain correct timing since different clusters may complete at different time.


0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21

123





Our novel FAN topology is both lightweight and flexible.

It can easily replace any other accelerators that use adder trees.

More details such as the routing algorithm and overhead analysis can be found in the paper.


0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21

124






It can replace regular adder trees in other hardware accelerators.



0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

1 5 9 13 17 21 25 29

19 27

23

0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31


3

7

15


11

25 13 21

125






It can replace regular adder trees in other hardware accelerators.




○ Utilization on TPU


• SIGMA



• Evaluation

• Conclusion

126


SIGMA High Level Diagram

Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.




Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM

matrices.





matrices.

Global Controller● Logic comparisons on bitmaps to determine

what nonzero stationary elements are required





matrices.



Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary

elements and sending streaming elements





matrices.





Accumulation SRAM ● Buffer for partial sum accumulations





matrices.





Accumulation SRAM ● Buffer for partial sum accumulations

SIGMA Engine● Compute engine



X

x x x x

- - - -

+ +

+

- - - -

X

x x x x

- - - -

+ +

+

- - - -

SIGMA Flex-DPE 0

Distribution busSwitch

Distribution Network - Benes

Buffers

Multipliers

Reduction Network - FAN

SIGMAFlex-DPE N


+

SIGMA Engine● Compute engine





• SIGMA



• Evaluation

• Conclusion

134

Methodology

135

● Hardware components are written in Verilog

● Post layout area and power numbers are on a 28nm process

● Analytical model for cycle counts assumes uniform random sparsity



128 2048 4096320 3072 40961632 36548 10242048 4096 32


1024 16 50000035 8457 2560


31999 1024 8484 1024 4096


2048 1 128256 256 2048


SIGMA vs TPU - Dense GEMMs

136


For GEMMs with dimensions that fit perfectly on the TPU, SIGMA offers slight speedup from its O(1) distribution and O(logN) reduction.

137


SIGMA offers 8x speedup for this GEMM. This is because the N dimension of 16 cannot fit perfectly on TPU. Also, O(logN) reduction is more effective since K = 500000.

138


SIGMA offers large speedup from its scalable FAN reduction (more effective since K = 500000). Additionally, this GEMM cannot fit perfectly on TPU because of the N dimension.

139

SIGMA performs on average 1.8x better than systolic array architectures for irregular GEMMs.

SIGMA vs TPU - Sparse GEMMs

140


141

SIGMA enables speedup by only mapping nonzero elements stationary.


142

SIGMA enables speedup by only mapping nonzero elements stationary.

SIGMA performs on average 5.7x better than systolic array architectures for sparse and irregular GEMMs.

SIGMA vs Sparse Accelerators

143

SIGMA Qualitative Analysis

144

SIGMA Qualitative Analysis

145

SIGMA performs on average 3x better than state-of-the-art sparse accelerators. In depth analysis can

be found in the paper.

Systolic Array vs SIGMA Comparison

146

** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.


147


SIGMA consumes 38% more area and 82% more power than Systolic Array.


148


SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.


149


SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.

SIGMA consumes more resources but achieves higher effective TFLOPS/W.





• SIGMA



• Evaluation

• Conclusion

150

Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.

151


● High utilization from systolic arrays is challenging because of their rigid structure.

152



● SIGMA enables high compute utilization on sparse irregular GEMMs.

153




● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse

accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN

reduction interconnects.

154







● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.

155







● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.

● Future work: Optimizations such as power gating and software stack design.

Thank you! 😊

156

Material Backup

157

Walkthrough

0 1 1 1

1 1 1 0

1 1 1 0

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Dim. Registers

M-dim 3

N-dim 4

K-dim 4

Compressed bitmap equivalent

Stationary Bitmap

Streaming Bitmap

A B C D E F G H I

Stationary Nonzero data

a b c d e f g

Streaming Nonzero data

158

i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce

Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

159


Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

160


Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

161


Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

162


Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

0 1 1 1

1 1 1 0

1 1 1 0

Stationary’ Bitmap

163


Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

A B C D E F G H I

0 1 1 0

1 1 1 0

1 1 1 0


164


Walkthrough Section

Example GEMM

Walkthrough

1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

1

1

1

0

0 1 1 1

1 1 1 0

1 1 1 0

Stationary Bitmap

A B C D E F G H I

0 1 1 0

1 1 1 0

1 1 1 0


165


Walkthrough Section

Example GEMM

The main idea is to find the necessary nonzero elements to map stationary.

Walkthrough

X

x x x x

A B D E

+ +

+

- - - -

X

x x x x

- - - -

+ +

+

- - - -

Flex-DPE 0 Flex-DPE 1



Buffers

Multipliers


F G H I

166


Walkthrough Section

Example GEMM

Cycle 1: unicast elements stationary to Flex-DPE 0

+

Walkthrough

X

A B D E- - - -

X

- - - -

Flex-DPE 0 Flex-DPE 1



Buffers

Multipliers


F G H I

167


Walkthrough Section

Example GEMM

Cycle 2: unicast elements stationary to Flex-DPE 1

x x x x

+ +

+

x x x x

+ +

+

+

Walkthrough

0 1 1 0

1 1 1 0

1 1 1 0


1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1

2 3 0

1 2 3

1 ... ... ...

1 ... ... ...

1 ... ... ...

Output BitmapRow ID0

1

2

0

1

168


Walkthrough Section

Example GEMM

Walkthrough

0 1 1 0

1 1 1 0

1 1 1 0


1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1

2 3 0

1 2 3

1 ... ... ...

1 ... ... ...

1 ... ... ...

Output Bitmap0

1

2

0

1

169


Walkthrough Section

Example GEMM

Row ID

Walkthrough

0 1 1 0

1 1 1 0

1 1 1 0


1 1 0 1

0 1 1 0

1 0 0 1

0 0 0 0

Streaming Bitmap

0 1

2 3 0

1 2 3

1 ... ... ...

1 ... ... ...

1 ... ... ...

Output Bitmap0

1

2

0

1

170


Walkthrough Section

Example GEMM

Row ID

The main idea is to find the matching source and destination indices.

Walkthrough

X

A B D E

XDistribution busSwitch


Buffers

Multipliers


F G H I

a f - - a f - -

171

Cycle 3: multicast 1st column of streaming matrix and reduce


Walkthrough Section

Example GEMM

- f a - f a - f

b d

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

x x x x

+ +

+

x x x x

+ +

+

+

Walkthrough

X

A B D E



Buffers

Multipliers


F G H I

a f - - a f - -

172


Walkthrough Section

Example GEMM

- f a - f a - f

b d

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

x x x x

+ +

+

x x x x

+ +

+

+

Cycle 3: multicast 1st column of streaming matrix and reduce

Walkthrough

X

A B D E



Buffers

Multipliers


F G H I

b d - - b d - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

173


Walkthrough Section

Example GEMM

e

d - b d - b d -

Bf Da-- -- Ga --Ff Ifx x x x

+ +

+

x x x x

+ +

+

+

Cycle 4: multicast 2nd column of streaming matrix and reduce

Walkthrough

X

A B D E



Buffers

Multipliers


F G H I

e - - - e - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

174


Walkthrough Section

Example GEMM

e - - e - - e -

c g

-- DbAd Ed Gb Hd-- --

Bf Ga If

x x x x

+ +

+

x x x x

+ +

+

+

Da Ff

Cycle 5: multicast 3rd column of streaming matrix and reduce

Walkthrough

X

A B D E



Buffers

Multipliers


F G H I - g c - g c - g

c g - - c g - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

175


Walkthrough Section

Example GEMM

-- -- Ee -- He-- --

Ga + If

Db + Ed

x x x x

+ +

+

x x x x

+ +

+

+

Ae

Ad --

Da FfGb Hd

Cycle 6: multicast last column of streaming matrix and reduce

Walkthrough

X

A B D E



Buffers

Multipliers


F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

176

Cycle 7: FAN Reduction


Walkthrough Section

Example GEMM

-- Dc -- Gc --Fg Ig

Ee

x x x x

+ +

+

x x x x

+ +

+

+

Bg--

Ae --

Da + Ff

Db + Ed -- Gb + Hd

-- He

Walkthrough

X

A B D E



Buffers

Multipliers


F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

177


Walkthrough Section

Example GEMM

-- -- -- -- ---- --

Db + Ed

x x x x

+ +

+

x x x x

+ +

+

+

----

Bg Dc Fg

Ee -- He

Gc Ig


Walkthrough

X

A B D E



Buffers

Multipliers


F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

178



Walkthrough Section

Example GEMM

Ee

x x x x

+ +

+

x x x x

+ +

+

+

Dc Fg Gc + Ig

Walkthrough

X

A B D E



Buffers

Multipliers


F G H I - - - - - - - -

SIGMA Flex-DPE 0

SIGMA Flex-DPE 1

179



Walkthrough Section

Example GEMM

Dc + Fg

x x x x

+ +

+

x x x x

+ +

+

+

180

Documents

synergy.ece.gatech.edu SIGMA: A Sparse and Irregular GEMM … › wp-content › uploads › sites › ... · 2020-03-02 · 2048 1 128 256 256 2048 Mapping GEMMs onto TPUs - Irregularity