Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects
for DNN Training
ERIC QIN*, ANANDA SAMAJDAR*, HYOUKJUN KWON*, VINEET NADELLA*, SUDARSHAN SRINIVASAN#, DIPANKAR DAS#,
BHARAT KAUL#, TUSHAR KRISHNA*
http://synergy.ece.gatech.edu
* GEORGIA TECH# INTEL
1
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
2
Outline
3
• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
Deep Learning Applications
Speech Recognition Recommender SystemsLanguage Understanding
4
Deep Learning Applications
Speech Recognition Recommender SystemsLanguage Understanding
5
What is the key computation for these Deep Learning applications?
Runtime breakdown on V100 GPU
6
77 % 64 %
Transformer (Language Understanding)
GNMT (Machine Translation)
Runtime breakdown on V100 GPU
7
77 % 64 %
Transformer (Language Understanding)
GNMT (Machine Translation)
Matrix multiplications (GEMMs) consume around 70% of the total runtime when training modern deep learning workloads.
GEMMs in Deep Learning
M
K
K
N N
MForward Pass(Inference and Training)
Input Feature Map
Output Feature MapWeights
GEMM MNK Dimension
Representation
M dim: batch size
N dim: number of channels in the next layer
K dim: [H * W * C]
8Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.
GEMMs in Deep Learning
Backward Pass(Training)
M
K
K
N N
MForward Pass(Inference and Training)
Input Feature Map
Output Feature MapWeights
N
M
K
NWeightsT Output Feature Error
M
K
Input Feature Error
GEMM MNK Dimension
Representation
M dim: batch size
N dim: number of channels in the next layer
K dim: [H * W * C]
9Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al.
GEMMs in Deep Learning
Backward Pass(Training)
M
K
K
N N
MForward Pass(Inference and Training)
Input Feature Map
Output Feature MapWeights
N
M
K
NWeightsT
Gradient Computation(Training)
Output Feature Error
N
M Output Feature Error
M
K
InputFeature
MapT
M
K
Input Feature Error
K
N
𝚫Weights
GEMM MNK Dimension
Representation
M dim: batch size
N dim: number of channels in the next layer
K dim: [H * W * C]
Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 10
GEMMs in Deep Learning
Backward Pass(Training)
M
K
K
N N
MForward Pass(Inference and Training)
Input Feature Map
Output Feature MapWeights
N
M
K
NWeightsT
Gradient Computation(Training)
Output Feature Error
N
M Output Feature Error
M
K
InputFeature
MapT
M
K
Input Feature Error
K
N
𝚫Weights
GEMM MNK Dimension
Representation
M dim: batch size
N dim: number of channels in the next layer
K dim: [H * W * C]
Figure derived from HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array, HPCA 2019, Song et al. 11
GEMM is a key compute primitive to accelerate in hardware to speed up training.
Hardware for Accelerating GEMMs
12
SIMT Architectures
Nvidia GTX GPUs
13
SIMD ArchitecturesSIMT Architectures
Microsoft Brainwave ARM Trillium
Tesla FSDC
Nvidia GTX GPUs
Hardware for Accelerating GEMMs
14
SIMD ArchitecturesSIMT Architectures Systolic Architectures
Microsoft Brainwave ARM Trillium
Tesla FSDC
Nvidia GTX GPUs
Google TPU
Xilinx xDNN Nvidia Tensor Cores
Hardware for Accelerating GEMMs
15
SIMD ArchitecturesSIMT Architectures Systolic Architectures
Recently, systolic array based architectures are popular for accelerating GEMMs.
Microsoft Brainwave ARM Trillium
Tesla FSDC
Nvidia GTX GPUs
Google TPU
Xilinx xDNN Nvidia Tensor Cores
Hardware for Accelerating GEMMs
Target comparison: Google TPU
TPU figure from https://cloud.google.com/tpu/docs/system-architecture
Our target comparison is with the Google TPU, which uses 128 x 128 systolic arrays.
16
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
17
TPU (Systolic Array)
Systolic Array Architectures
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
18
TPU (Systolic Array) MAC PE
Systolic Array Architectures
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
19
+✕
Weight Buffer
Top input
Streaming input
Streaming output
Load (0) / Accum. (1)
Load (0) / Accum. (1)
0
1
Bottom output
TPU (Systolic Array)
Mapping GEMMs onto TPUs
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
20
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
GEMMs used for evaluation.
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
TPU (Systolic Array)
Mapping GEMMs onto TPUs
21
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
Let’s map this GEMM!
GEMMs used for evaluation.
TPU (Systolic Array)
Mapping GEMMs onto TPUs
StationarymatrixK
= 2
048
N = 256
22
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
StationarymatrixK
= 2
048
N = 256
23
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
StationarymatrixK
= 2
048
N = 256
24
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
StationarymatrixK
= 2
048
N = 256
25
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
Mapping GEMMs onto TPUs
26
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
TPU (Systolic Array)
Mapping GEMMs onto TPUs
27
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
TPU (Systolic Array)
Mapping GEMMs onto TPUs
28
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
TPU (Systolic Array)
Mapping GEMMs onto TPUs
29
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
TPU (Systolic Array)
Mapping GEMMs onto TPUs
30
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
TPU (Systolic Array)
Mapping GEMMs onto TPUs
31
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down each column.
TPU (Systolic Array)
Mapping GEMMs onto TPUs
32
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
The streaming elements get multiplied with the stationary elements. The partial sums
get accumulated down a column.
Systolic Arrays are popular because they enable efficient data reuse and are very simple to implement.
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
Mapping GEMMs onto TPUs - Irregularity
33
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Let’s map another GEMM!** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
GEMMs used for evaluation.
34
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
K = 32
N = 4096
K =
32
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
35
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
K = 32
N = 4096
K =
32
128
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
36
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
75% of the PEs are not utilized for this GEMM.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
K = 32
N = 4096
K =
32
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
37
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
K = 32
N = 4096
K =
32
128
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
75% of the PEs are not utilized for this GEMM.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
38
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
39
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
40
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
Mapping GEMMs onto TPUs
41
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
Mapping GEMMs onto TPUs
42
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
75% of the PEs are not utilized for this GEMM.
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Mapping GEMMs onto TPUs - Irregularity
The rigid structure of Systolic Arrays cause PE underutilization. How can we enable the remaining PEs?
43
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
44
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
45
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
46
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
duplicate streaming data
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
47
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
duplicate streaming data
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
48
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
49
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Can we map another section of the stationary matrix onto the TPU to increase utilization?
128
Mapping GEMMs onto TPUs - Irregularity
50
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
This is incorrect functionally, because the systolic reduction will accumulate dark blue and light purple partial product clusters together. This is due to a rigid aspect ratio of a systolic array. Can we map another section of
the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
128
51
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
No, systolic reduction will accumulate dark blue and light purple partial product clusters together, which is incorrect functionally. This is due to a rigid aspect ratio of a systolic array.
Can we map another section of the stationary matrix onto the TPU to increase utilization?
Mapping GEMMs onto TPUs - Irregularity
Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.
Observation 2: Sparse weights cause underutilization of the PEs.
Observation 3: Large systolic arrays have large load and reduction latency.
Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both
sparse and irregular.
Sparsity in DNN Models
GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)
52
Sparsity in DNN Models
GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)
Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)
53
Sparsity in DNN Models
GNMT Pruning - Temporal Sparsity(https://www.intel.ai/compressing-gnmt-models)
Transformer Sparsity - Impact on BLEU(The State of Sparsity in Deep Neural Networks, Gale et al., arXiv)
54
Weight sparsity ranges from 40% to 90%. Activation sparsity is approximately
30% to 70% from ReLU, dropout, etc.
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
Mapping GEMMs onto TPUs - Sparsity
55
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Usually these GEMMs are sparse!
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
GEMMs used for evaluation.
56
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
What happens if half of the stationary matrix are zeros?
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950
Mapping GEMMs onto TPUs - Sparsity
57
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
What happens if half of the stationary matrix are zeros?
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Mapping GEMMs onto TPUs - Sparsity
58
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
What happens if half of the stationary matrix are zeros?
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Mapping GEMMs onto TPUs - Sparsity
59
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
What happens if half of the stationary matrix are zeros?
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Multiplication with an operand that is zero is considered underutilized.
Mapping GEMMs onto TPUs - Sparsity
60
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
87.5% of the PEs are not utilized for this GEMM.
Multiplication with an operand that is zero is considered underutilized.
Mapping GEMMs onto TPUs - Sparsity
61
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
87.5% of the PEs are not utilized for this GEMM.
Multiplication with an operand that is zero is considered underutilized.
Mapping GEMMs onto TPUs - Sparsity
Weight stationary systolic arrays are underutilized for sparse GEMMs because they have to map zeros. How
can we map only nonzeros stationary?
62
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
63
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
64
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
65
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
66
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
67
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
128
68
Stationary matrix
M =
204
8
Stre
amin
g m
atri
x
N = 4096
K =
32
128
K = 32
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
15
31
47
63
79
95
111
127
15 31 47 63 79 111 127950M = 2048
Reduce partial sum down each column.
Dark blue and light purple clusters accumulate, which is incorrect. Systolic reduction only allows fixed size dot products, which is the size of a column.
Can we map other nonzero elements where the idle PEs used to be?
Mapping GEMMs onto TPUs - Sparsity
Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.
Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.
Observation 3: Large systolic arrays have large load and reduction latency.
Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both
sparse and irregular.
TPU (Systolic Array)
Mapping GEMMs onto TPUs - Scalability
StationarymatrixK
= 2
048
N = 256
69
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
TPU (Systolic Array)
StationarymatrixK
= 2
048
N = 256
70
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Cycle 1
Mapping GEMMs onto TPUs - Scalability
TPU (Systolic Array)
StationarymatrixK
= 2
048
N = 256
71
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Cycle 2
Mapping GEMMs onto TPUs - Scalability
TPU (Systolic Array)
StationarymatrixK
= 2
048
N = 256
72
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Cycle 3
Mapping GEMMs onto TPUs - Scalability
TPU (Systolic Array)
StationarymatrixK
= 2
048
N = 256
73
K = 2048
M =
256
Streaming matrix
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127 128
128
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
Cycle 128Loading takes 128 cycles, which
scales at O(sqrtN).
Mapping GEMMs onto TPUs - Scalability
TPU (Systolic Array)
74
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
Reduction will take 128 cycles for the last accumulate to finish before loading a new
portion of the stationary matrix.
Mapping GEMMs onto TPUs - Scalability
TPU (Systolic Array)
75
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
Reduction will take 128 cycles for the last accumulate to finish before loading a new
portion of the stationary matrix.
Mapping GEMMs onto TPUs - Scalability
Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.
Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.
Observation 3: Large systolic arrays have significant load and reduction latency.
Takeaway: Systolic Arrays have limitations on emerging GEMM workloads that are both
sparse and irregular.
TPU (Systolic Array)
76
0
1
2
3
4
.
.
127
0 1 2 3 4 . . 127M = 256
Reduce partial sum down each column.
** Assuming MK matrix is streaming and KN matrix is stationary. (aka weight stationary)
StationarymatrixK
= 2
048
N = 256
K = 2048
M =
256
Streaming matrix
128
128
128
Reduction will take 128 cycles for the last accumulate to finish before loading a new
portion of the stationary matrix.
Mapping GEMMs onto TPUs - Scalability
Observation 1: GEMMs are irregular and may not align to the aspect ratio of the systolic array.
Observation 2: Sparse weights cause underutilization of the PEs and require variable sized dot product accumulation.
Observation 3: Large systolic arrays have significant load and reduction latency.
Takeaway: Systolic Arrays are underutilized on emerging GEMM workloads that are
both sparse and irregular.
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
77
Key Accelerator Requirements
Requirement Systolic Array Limitation SIGMA Desired Traits
Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio
Sparsity Support ● Data forwarding every cycle in horizontal / vertical direction requires systolic array to map zeros.
● Support sparsity by only mapping nonzeros
● ability to create simultaneous variable sized dot product
Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction
● O(1) distribution● O(log2N) reduction
78
Key Accelerator Requirements
Requirement Systolic Array Limitation SIGMA Desired Traits
Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio
Sparsity Support ● data forwarding every cycle requires systolic array to map zeros
● sparsity support by mapping only nonzeros stationary
● ability to create simultaneous variable sized dot product
Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction
● O(1) distribution● O(log2N) reduction
79
Key Accelerator Requirements
Requirement Systolic Array Limitation SIGMA Desired Traits
Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio
Sparsity Support ● data forwarding every cycle requires systolic array to map zeros
● sparsity support by mapping only nonzeros stationary
● ability to create simultaneous variable sized dot product
Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction
● O(1) distribution● O(logN) reduction
80
Key Accelerator Requirements
Requirement Systolic Array Limitation SIGMA Desired Traits
Flexibility ● rigid aspect ratio ● ability to mimic any 2D aspect ratio
Sparsity Support ● data forwarding every cycle requires systolic array to map zeros
● sparsity support by mapping only nonzeros
● ability to create simultaneous variable sized dot product
Scalability ● O(sqrtN) distribution ● O(sqrtN) reduction
● O(1) distribution● O(logN) reduction
81
With flexible and scalable interconnects between all PEs, SIGMA can solve the three requirements.
Systolic vs SIGMA High Level Interconnects
A B C D
I J K L
4 x 4 Systolic Array
82
A B C D
I J K L
E F G H
M N O P
16 PE SIGMA
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
Systolic vs SIGMA High Level Interconnects
83
16 PE SIGMA
A B C D
I J K L
E F G H
Distribution Network
M N O P
** Microarchitecture details on the networks will be discussed later
A B C D
I J K L
4 x 4 Systolic Array
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
Systolic vs SIGMA High Level Interconnects
84
16 PE SIGMA
A B C D
I J K L
E F G H
Distribution Network
M N O P
** Microarchitecture details on the networks will be discussed later
A B C D
I J K L
4 x 4 Systolic Array
● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs
● Ability to send any streaming element to any PE
● O(1) distribution
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
Systolic vs SIGMA High Level Interconnects
85
** Microarchitecture details on the networks will be discussed later
A B C D
I J K L
4 x 4 Systolic Array 16 PE SIGMA
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs
● Ability to send any streaming element to any PE
● O(1) distribution
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
Systolic vs SIGMA High Level Interconnects
86
** Microarchitecture details on the networks will be discussed later
A B C D
I J K L
4 x 4 Systolic Array 16 PE SIGMA
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
● Distribution network allows SIGMA to mimic any aspect ratio to address irregular GEMMs
● Ability to send any streaming element to any PE
● O(1) distribution
● Reduction network allows SIGMA to reduce variable sized dot products
● Addresses sparsity and irregularity
● O(logN) reduction
● rigid aspect ratio● fixed size dot product● O(sqrtN) distribution and
reduction
A B C D
I J K L
4 x 4 Systolic Array
a
b
c
d
e
f
g
h
87
Irregular GEMMs on SIGMA
Set up the stationary values.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
a
b
c
d
e
f
g
h
e
f
g
h
88
16 PE SIGMA
Irregular GEMMs on SIGMA
Set up the stationary values.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
a
b
c
d
e
f
g
h
e
f
g
h
a
b
a
b
a
b
a
b
a
b
a
b
a
b
89
16 PE SIGMA
Irregular GEMMs on SIGMA
Next cycle: Multicast first row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
c
d
e
f
g
h
e
f
g
h
c
d
c
d
c
d
c
d
d
c
c
d
c
d
90
16 PE SIGMA
Irregular GEMMs on SIGMA
Next cycle: Multicast second row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
e
f
e
f
g
h
g
h
e
f
e
f
e
f
e
f
f
e
e
f
e
f
91
16 PE SIGMA
Irregular GEMMs on SIGMA
Next cycle: Multicast third row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
I J K L
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
c
d
g
h
e
f
g
h
g
h
g
h
g
h
g
h
h
g
g
h
g
h
92
16 PE SIGMA
Irregular GEMMs on SIGMA
Next cycle: Multicast fourth row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
E F G H
M N O P
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
g
h
g
h
g
h
g
h
g
h
h
g
g
h
g
h
a
b
c
d
e
f
g
h
93
16 PE SIGMA
Irregular GEMMs on SIGMA
After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again (referred to as folding).
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
E F G H
M N O P
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
g
h
g
h
g
h
g
h
g
h
h
g
g
h
g
h
a
b
c
d
e
f
g
h
94
16 PE SIGMA
Irregular GEMMs on SIGMA
After accumulation, SIGMA is done. However, the systolic array has to map the other side of the stationary matrix and stream in the MK matrix again.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
SIGMA reduces the number of folds, which then reduces the number of memory references on the streaming matrix.
E F G H
M N O P
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
95
16 PE SIGMA
Irregular GEMMs on SIGMA
Final cycle count.
Systolic Array total runtime:
24 cycles
SIGMA total runtime:
13 cycles
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
E F G H
M N O P
A B C D
I J K L
E F G H
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
96
16 PE SIGMA
Irregular GEMMs on SIGMA
Final cycle count.
Systolic Array total runtime:
24 cycles
SIGMA total runtime:
13 cycles
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
SIGMA maximizes PE utilization with its flexible interconnects for irregular GEMMs.
Sparse Irregular GEMMs on SIGMA
A B C D
H I J
4 x 4 Systolic Array
a
b
cd
f
g
e
97
Set up the stationary values.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
cd
f
g
e
a
b
cd
f
g
e
98
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Set up the stationary values.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
a
b
cd
f
g
e
cd
f
g
e
a
b
a
b
b
b
a
b
a
b
a
b
a
b
99
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Next cycle: Multicast first row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
ccd
f
g
e
d
f
g
e
c c
c c c c
100
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Next cycle: Multicast second row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b
dcd
f
g
e
g
e
d d
d d d d
101
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Next cycle: Multicast third row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
A B C D
H I J
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
a
b e
cd
f
g
e e e
e
e
e e e e
102
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Next cycle: Multicast fourth row of MK to the corresponding stationary elements.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
C D E
K L M N
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
4 x 4 Systolic Array
e e e
e
e
e e e e
a
b
cd
f
g
e
103
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
After accumulation, SIGMA is done. However, the systolic array has to map the other part of the stationary matrix and stream in the MK matrix again (referred to as folding).
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
F G
O P
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
16 PE SIGMA Flex-DPE4 x 4 Systolic Array
e e e
e
e
e e e e
a
b
cd
f
g
e
104
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Again, the systolic array has to map another part of the stationary matrix and stream MK again.
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
F G
O P
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
16 PE SIGMA Flex-DPE4 x 4 Systolic Array
105
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Final cycle count.
Systolic Array total runtime:
34 cycles
SIGMA total runtime:
13 cycles
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
F G
O P
A B J C
H I L K
D E F G
Distribution Network
Reduction NetworkM N O P
16 PE SIGMA Flex-DPE4 x 4 Systolic Array
106
16 PE SIGMA
Sparse Irregular GEMMs on SIGMA
Final cycle count.
Systolic Array total runtime:
34 cycles
SIGMA total runtime:
13 cycles
** Assuming MK matrix is
streaming and KN matrix is
stationary. (aka weight
stationary)
SIGMA maps only nonzeros stationary; therefore, reduces the number of folds needed.
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU (Systolic Array)
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
107
O(1) Distribution TopologyUnicast
(Loading Stationary Matrix)
Crossbar
X X X XX X X X X X X X X X X X
Benes Crossbar Benes
108
Source
Destination Destination Destination Destination
Source Source Source
Multicast (Sending Streaming Matrix)
O(1) Distribution Topology
Crossbar Benes
109
Source Source
Crossbar BenesSource Source
X X X XX X X X X X X X X X X X
Destination Destination Destination Destination
Crossbar Benes Crossbar Benes
Unicast (Loading Stationary Matrix)
Multicast (Sending Streaming Matrix)
O(1) Distribution Topology
Crossbar Benes
110
Source Source
Crossbar BenesSource Source
X X X XX X X X X X X X X X X X
Destination Destination Destination Destination
Crossbar Benes Crossbar Benes
Unicast (Loading Stationary Matrix)
Multicast (Sending Streaming Matrix)
O(1) Distribution Topology
Crossbar Benes
Unicast Multicast
111
Source Source
Crossbar BenesSource Source
X X X XX X X X X X X X X X X X
Destination Destination Destination Destination
SIGMA’s distribution can be either a Crossbar or Benes network. We chose Benes because the number of
switches scale by O(N logN).
O(logN) Reduction (Limitation of Adder Tree)
Log2(N) reduction time 112
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15 2-input BF16 Mult Switch2-input BF32 Adder Switch
11
25 13 21
Different clusters of partial sums.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
Log2(N) reduction time 113
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15 2-input BF16 Mult Switch2-input BF32 Adder Switch
11
25 13 21
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Different clusters of partial sums.
Regular adder tree will accumulate a + b, which is incorrect functionality.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
O(logN) Reduction (Limitation of Adder Tree)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Forwarding Adder Network (FAN)
114
Different clusters of partial sums.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Forwarding Adder Network (FAN)
115
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 17 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Pipelined at 1 cycle per adder level.
Forwarding Adder Network (FAN)
116
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Cycle 1: Bypass conflicting partial sums
117
Forwarding Adder Network (FAN)
Bypass adder and forward partial sum to next level!
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Cycle 2: Partial sum red complete
118
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
sent to output buffer
Forwarding Adder Network (FAN)
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 1 21
119
Forwarding Adder Network (FAN)
Cycle 3: Partial sums green and orange complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
sent to output buffer sent to output buffer
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
Cycle 4: Partial sum blue complete
120
Forwarding Adder Network (FAN)
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
sent to output buffer
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
121
Forwarding Adder Network (FAN)
Cycle 5: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
sent to output buffer
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
122
Forwarding Adder Network (FAN)
Cycle 5: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
+
x x xx+
+
+x x xx
+
+
+
+
x x xx+
+
+xx
+
+
+
+
+ +2-input BF16 Mult Switch2-input BF32 Adder Switch
x
x x+
send to output buffer
Note: The output buffer has FFs to maintain correct timing since different clusters may complete at different time.
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
123
Forwarding Adder Network (FAN)
Cycle 4: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
Our novel FAN topology is both lightweight and flexible.
It can easily replace any other accelerators that use adder trees.
More details such as the routing algorithm and overhead analysis can be found in the paper.
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
124
Forwarding Adder Network (FAN)
Cycle 4: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
Our novel FAN topology is both lightweight and flexible.
It can replace regular adder trees in other hardware accelerators.
More details such as the routing algorithm and overhead analysis can be found in the paper.
Log2(N) reduction time
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
1 5 9 13 17 21 25 29
19 27
23
0 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a a a b b b b b b c c c c d d d d d d d d d d d e e e e e e e e
3
7
15
N-to-2 Mux2-input BF16 Mult Switch2-input BF32 Adder SwitchForwarding FF
11
25 13 21
125
Forwarding Adder Network (FAN)
Cycle 4: Partial sum purple complete
Different clusters of partial sums.
FAN is optimized for floating point reductions, commonly used during DNN training.
Our novel FAN topology is both lightweight and flexible.
It can replace regular adder trees in other hardware accelerators.
More details such as the routing algorithm and overhead analysis can be found in the paper.
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
126
Log2(N) reduction time 127
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Log2(N) reduction time 128
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
Log2(N) reduction time 129
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
Global Controller● Logic comparisons on bitmaps to determine
what nonzero stationary elements are required
Log2(N) reduction time 130
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
Global Controller● Logic comparisons on bitmaps to determine
what nonzero stationary elements are required
Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary
elements and sending streaming elements
Log2(N) reduction time 131
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
Global Controller● Logic comparisons on bitmaps to determine
what nonzero stationary elements are required
Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary
elements and sending streaming elements
Accumulation SRAM ● Buffer for partial sum accumulations
Log2(N) reduction time 132
SIGMA High Level Diagram
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
Data and Bitmap SRAM Banks● Contains bitmap compression format of GEMM
matrices.
Global Controller● Logic comparisons on bitmaps to determine
what nonzero stationary elements are required
Sparsity Filter && Input Data Arbiter● Reorganizes data for loading stationary
elements and sending streaming elements
Accumulation SRAM ● Buffer for partial sum accumulations
SIGMA Engine● Compute engine
Log2(N) reduction time 133
SIGMA High Level Diagram
X
x x x x
- - - -
+ +
+
- - - -
X
x x x x
- - - -
+ +
+
- - - -
SIGMA Flex-DPE 0
Distribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
SIGMAFlex-DPE N
Note: SIGMA Engine contains multiple SIGMA units called Flex-DPEs.
+
SIGMA Engine● Compute engine
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
134
Methodology
135
● Hardware components are written in Verilog
● Post layout area and power numbers are on a 28nm process
● Analytical model for cycle counts assumes uniform random sparsity
Workload Application Example DimensionsM N K
GNMT Machine Translation
128 2048 4096320 3072 40961632 36548 10242048 4096 32
DeepBench General Workload
1024 16 50000035 8457 2560
Transformer Language Understanding
31999 1024 8484 1024 4096
NCF Collaborative Filtering
2048 1 128256 256 2048
GEMMs used for evaluation.
SIGMA vs TPU - Dense GEMMs
136
SIGMA vs TPU - Dense GEMMs
For GEMMs with dimensions that fit perfectly on the TPU, SIGMA offers slight speedup from its O(1) distribution and O(logN) reduction.
137
SIGMA vs TPU - Dense GEMMs
SIGMA offers 8x speedup for this GEMM. This is because the N dimension of 16 cannot fit perfectly on TPU. Also, O(logN) reduction is more effective since K = 500000.
138
SIGMA vs TPU - Dense GEMMs
SIGMA offers large speedup from its scalable FAN reduction (more effective since K = 500000). Additionally, this GEMM cannot fit perfectly on TPU because of the N dimension.
139
SIGMA performs on average 1.8x better than systolic array architectures for irregular GEMMs.
SIGMA vs TPU - Sparse GEMMs
140
SIGMA vs TPU - Sparse GEMMs
141
SIGMA enables speedup by only mapping nonzero elements stationary.
SIGMA vs TPU - Sparse GEMMs
142
SIGMA enables speedup by only mapping nonzero elements stationary.
SIGMA performs on average 5.7x better than systolic array architectures for sparse and irregular GEMMs.
SIGMA vs Sparse Accelerators
143
SIGMA Qualitative Analysis
144
SIGMA Qualitative Analysis
145
SIGMA performs on average 3x better than state-of-the-art sparse accelerators. In depth analysis can
be found in the paper.
Systolic Array vs SIGMA Comparison
146
** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.
Systolic Array vs SIGMA Comparison
147
** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.
SIGMA consumes 38% more area and 82% more power than Systolic Array.
Systolic Array vs SIGMA Comparison
148
** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.
SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.
Systolic Array vs SIGMA Comparison
149
** Effective TFLOPs is calculated by multiplying the base dense TFLOPs with the average efficiency computed across GEMMs.
SIGMA achieves 5.7x higher effective TFLOPS for a 3.2x higher effective TFLOPS/W.
SIGMA consumes more resources but achieves higher effective TFLOPS/W.
Outline• Motivation
○ GEMMs in Deep Learning
○ Utilization on TPU
• Accelerator Requirements
• SIGMA
○ Interconnects Implementations
○ Full System Design
• Evaluation
• Conclusion
150
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
151
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
152
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
● SIGMA enables high compute utilization on sparse irregular GEMMs.
153
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
● SIGMA enables high compute utilization on sparse irregular GEMMs.
● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse
accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN
reduction interconnects.
154
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
● SIGMA enables high compute utilization on sparse irregular GEMMs.
● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse
accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN
reduction interconnects.
● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.
155
Conclusion● GEMM is a key component of Deep Learning workloads, but they are often irregular and sparse.
● High utilization from systolic arrays is challenging because of their rigid structure.
● SIGMA enables high compute utilization on sparse irregular GEMMs.
● SIGMA performs 5.7x better than systolic arrays and 3x better than other state-of-the-art sparse
accelerators at the cost of extra hardware, specifically for the O(1) distribution and the novel FAN
reduction interconnects.
● SIGMA achieves 3.2x higher Effective TFLOPS/W than Systolic Arrays.
● Future work: Optimizations such as power gating and software stack design.
Thank you! 😊
156
Material Backup
157
Walkthrough
0 1 1 1
1 1 1 0
1 1 1 0
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Dim. Registers
M-dim 3
N-dim 4
K-dim 4
Compressed bitmap equivalent
Stationary Bitmap
Streaming Bitmap
A B C D E F G H I
Stationary Nonzero data
a b c d e f g
Streaming Nonzero data
158
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
159
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
160
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
161
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
162
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
0 1 1 1
1 1 1 0
1 1 1 0
Stationary’ Bitmap
163
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
A B C D E F G H I
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
164
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Walkthrough
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
1
1
1
0
0 1 1 1
1 1 1 0
1 1 1 0
Stationary Bitmap
A B C D E F G H I
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
165
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
The main idea is to find the necessary nonzero elements to map stationary.
Walkthrough
X
x x x x
A B D E
+ +
+
- - - -
X
x x x x
- - - -
+ +
+
- - - -
Flex-DPE 0 Flex-DPE 1
Distribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
166
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Cycle 1: unicast elements stationary to Flex-DPE 0
+
Walkthrough
X
A B D E- - - -
X
- - - -
Flex-DPE 0 Flex-DPE 1
Distribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
167
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Cycle 2: unicast elements stationary to Flex-DPE 1
x x x x
+ +
+
x x x x
+ +
+
+
Walkthrough
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
0 1
2 3 0
1 2 3
1 ... ... ...
1 ... ... ...
1 ... ... ...
Output BitmapRow ID0
1
2
0
1
168
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Walkthrough
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
0 1
2 3 0
1 2 3
1 ... ... ...
1 ... ... ...
1 ... ... ...
Output Bitmap0
1
2
0
1
169
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Row ID
Walkthrough
0 1 1 0
1 1 1 0
1 1 1 0
Stationary’ Bitmap
1 1 0 1
0 1 1 0
1 0 0 1
0 0 0 0
Streaming Bitmap
0 1
2 3 0
1 2 3
1 ... ... ...
1 ... ... ...
1 ... ... ...
Output Bitmap0
1
2
0
1
170
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Row ID
The main idea is to find the matching source and destination indices.
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
a f - - a f - -
171
Cycle 3: multicast 1st column of streaming matrix and reduce
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
- f a - f a - f
b d
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
x x x x
+ +
+
x x x x
+ +
+
+
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
a f - - a f - -
172
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
- f a - f a - f
b d
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
x x x x
+ +
+
x x x x
+ +
+
+
Cycle 3: multicast 1st column of streaming matrix and reduce
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
b d - - b d - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
173
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
e
d - b d - b d -
Bf Da-- -- Ga --Ff Ifx x x x
+ +
+
x x x x
+ +
+
+
Cycle 4: multicast 2nd column of streaming matrix and reduce
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I
e - - - e - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
174
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
e - - e - - e -
c g
-- DbAd Ed Gb Hd-- --
Bf Ga If
x x x x
+ +
+
x x x x
+ +
+
+
Da Ff
Cycle 5: multicast 3rd column of streaming matrix and reduce
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - g c - g c - g
c g - - c g - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
175
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
-- -- Ee -- He-- --
Ga + If
Db + Ed
x x x x
+ +
+
x x x x
+ +
+
+
Ae
Ad --
Da FfGb Hd
Cycle 6: multicast last column of streaming matrix and reduce
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - - - - - - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
176
Cycle 7: FAN Reduction
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
-- Dc -- Gc --Fg Ig
Ee
x x x x
+ +
+
x x x x
+ +
+
+
Bg--
Ae --
Da + Ff
Db + Ed -- Gb + Hd
-- He
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - - - - - - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
177
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
-- -- -- -- ---- --
Db + Ed
x x x x
+ +
+
x x x x
+ +
+
+
----
Bg Dc Fg
Ee -- He
Gc Ig
Cycle 8: FAN Reduction
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - - - - - - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
178
Cycle 9: FAN Reduction
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Ee
x x x x
+ +
+
x x x x
+ +
+
+
Dc Fg Gc + Ig
Walkthrough
X
A B D E
XDistribution busSwitch
Distribution Network - Benes
Buffers
Multipliers
Reduction Network - FAN
F G H I - - - - - - - -
SIGMA Flex-DPE 0
SIGMA Flex-DPE 1
179
Cycle 10: FAN Reduction
i) Get bitmap from memoryii) Row wise OR on streaming matrixiii) Element wise AND on stationary matrixiv) Generate stationary’ bitmapvi) Unicast stationary valuesvii) Get source - destination pairsviii) Multicast streaming values and reduce
Walkthrough Section
Example GEMM
Dc + Fg
x x x x
+ +
+
x x x x
+ +
+
+
180