34
Tensorlib: A Spatial Accelerator Generation Framework for Tensor Algebra Liancheng Jia 1 , Zizhang Luo 1 , Liqiang Lu 1 , and Yun Liang 1,2 1 Center for Energy-Efficient Computing and Applications, School of EECS, Peking University 2 Pengcheng laboratory, China {jlc, semiwaker, liqianglu, ericlyun}@pku.edu.cn 1

Tensorlib: A Spatial Accelerator Generation Framework for

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tensorlib: A Spatial Accelerator Generation Framework for

Tensorlib: A Spatial Accelerator Generation Framework for Tensor Algebra

Liancheng Jia1, Zizhang Luo1, Liqiang Lu1, and Yun Liang1,2

1Center for Energy-Efficient Computing and Applications, School of EECS, Peking University

2Pengcheng laboratory, China{jlc, semiwaker, liqianglu, ericlyun}@pku.edu.cn

1

Page 2: Tensorlib: A Spatial Accelerator Generation Framework for

Overview

HASCO HW/SW Co-design for Tensor Computation

TENET Modeling Tensor Dataflow Based on Relation-centric Notation

Tensorlib Auto-Generation of Spatial Hardware Accelerators

2

Page 3: Tensorlib: A Spatial Accelerator Generation Framework for

Spatial Accelerator Architecture

PE Array

On-chip Buffer

… … …

Main Memory Host CPU

ControllerBank 1 Bank 2 …

PE0,0 PE0,1

PEM,0 PEM,1 PEM,N

PE0,N

PE Array

On-Chip Buffer

Controller

Typical Architecture for Spatial Accelerators

3

Page 4: Tensorlib: A Spatial Accelerator Generation Framework for

Spatial Accelerators for Tensor Algebra

Google TPU ShiDiannao

Ctrl

Core Core

Core CoreDRAM

Core Core

SRAM

NOC

PE Array

MIT Eyeriss

Inpu

tIm

age

Buf IN

Buf Out

Buf S

BufCtrl

Ctrl

PE Array

Multiple Memory Banks 2D PE Array

Unified ControllerMulti-Level Storage

HBM

8GB

vectorunit

vectorunit

MXU MXU

Vector Mem\ \

Ctrl

4

Page 5: Tensorlib: A Spatial Accelerator Generation Framework for

Various Hardware Dataflows

Output stationary systolic

Reduction Tree(Applied by MAERI)

Multicast

+ +

+

Row Stationary(Applied by Eyeriss)

Multicast+systolic(Applied by ShiDiannao)

Weight stationary systolic(Applied by TPU)

. . . . . .

. . . . . .

Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." ISCA 2017Chen, Yu-Hsin, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." JSSC 2016Du, Zidong, et al. "ShiDianNao: Shifting vision processing closer to the sensor." ISCA 2015H Kwon et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects ASPLOS 2018 5

Page 6: Tensorlib: A Spatial Accelerator Generation Framework for

Architecture Similarity and Difference

Multi-Level Storage HierarchyOn-chip buffer is divided into banks

PEs form a spatial array

Controlled by a unified controller

Different PE array dataflow

Different PE size and data type

Different kernel implementation

Different NoC architecture

6

Page 7: Tensorlib: A Spatial Accelerator Generation Framework for

Current Solution of Spatial Accelerator Development

Expression Performance Productivity Dataflow SupportManual HDL Verilog ★★★ ★ ★

High-Level Synthesis C/C++ ★★ ★★ ★

HLS + DSL Halide / TVM DSL ★ ★★★ ★★

Tensorlib(Ours)

Space-Time Transformation ★★★ ★★★ ★★★

Dilemma between productivity and performance

7

Page 8: Tensorlib: A Spatial Accelerator Generation Framework for

Overview of Tensorlib

DataflowDefinition

ComputeDefinition

Tensor A: Systolic ↑Tensor B: StationaryTensor C: Systolic→…

Dataflow Generation

Space-Time Dataflow Analysis

PE internal modules

PE Interconnection Codegen

Chisel

Verilog

FPGA/ASICSynthesis

Simple Expression

VariousDataflows

GoodProductivity

HighPerformance

8

Page 9: Tensorlib: A Spatial Accelerator Generation Framework for

Dataflow Analysis

9

Page 10: Tensorlib: A Spatial Accelerator Generation Framework for

Relations in Spatial Accelerators

PE PE

PE PE

0 1 2

𝑖𝑖𝑗𝑗𝑘𝑘

cycle

instance to PE

itera

tion

dom

ain

instance to execution sequence

Dataflow RelationInstance-Hardware

output tensor

PE

A[0][2]B[2][0]Y[0][0]

0 1 2

A[0][0] A[0][1]A[1][0]

A[1][1]A[2][0]A[0][2]

input tensor

data to PE

data to execution sequence

Data AssignmentHardware-Data

Instance-Data

𝑖𝑖𝑗𝑗𝑘𝑘

itera

tion

dom

ain

output tensor

input tensor

L.Lu et al. TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation ISCA, 202110

Page 11: Tensorlib: A Spatial Accelerator Generation Framework for

How to represent relations?

loop nest:for (i = 0; i < 3; i++)

for (j = 0; j < 3; j++)for (k = 0; k < 3; k++)

instance [i,j,k]:Y[i,j] += A[i,k] * B[k,j];

[i, j, k]

instance

A[i, k]

tensor index× 1 0 00 0 1

PE[x, y]Cycle[t]

PE andCycle index

hardware-data

Space-Time Transformation

instance-data

instance-hardware

?

11

Page 12: Tensorlib: A Spatial Accelerator Generation Framework for

Space-Time Transformation

STT ⋅122

=

1

2

5

PE0,0 PE0,1 PE0,2

PE1,0 PE1,1 PE1,2

PE2,0 PE2,1 PE2.2

cycle0 1 2 3 4 5 6

Index:𝑖𝑖𝑗𝑗𝑘𝑘

=122

𝑆𝑆𝑆𝑆𝑆𝑆 =1 0 00 1 01 1 1

loop nest:for (i = 0; i < 3; i++)

for (j = 0; j < 3; j++)for (k = 0; k < 3; k++)

instance [i,j,k]:Y[i,j] += A[i,k] * B[k,j];

Space-Time Transformation (STT):

Transform loop iteration domain into hardware space-time domainCan be expressed with Matrix Multiplication

𝑆𝑆𝑆𝑆𝑆𝑆𝑖𝑖𝑗𝑗𝑘𝑘

=𝑥𝑥𝑦𝑦𝑡𝑡

Example:

12

Page 13: Tensorlib: A Spatial Accelerator Generation Framework for

Formulating Hardware-Data Relation

PE[1, 0] PE[1, 1]

PE[0, 0] PE[0, 1]

A[0, 0]

A[0, 1] A[1, 0]

PE[1, 0] PE[1, 1]

PE[0, 0] PE[0, 1]

A[0, 1] A[1, 0]

A[0, 2] A[1, 1]

PE[1, 0] PE[1, 1]

PE[0, 0] PE[0, 1]

A[0, 2] A[1, 1]

A[0, 3] A[1, 2]

t=1 t=2 t=3

PE[1, 0] PE[1, 1]

PE[0, 0] PE[0, 1]

[1, 0, 0]

[0, 0, 1] [1, 0, 1]

PE[1, 0] PE[1, 1]

PE[0, 0] PE[0, 1]

[1, 0, 1] [1, 1, 0]

[0, 0, 2] [1, 0, 2]

PE[1, 0] PE[1, 1]

PE[0, 0] PE[0, 1]

[1, 0, 1] [1, 1, 1]

[0, 0, 3] [1, 0, 3]

t=1 t=2 t=3

Step 1: space-time domain -> instance domain

Step 2: instance domain -> tensor index domain

Step 1: Inverse Space-Time Transformation:

𝑆𝑆𝑆𝑆𝑆𝑆−1𝑥𝑥𝑦𝑦𝑡𝑡

=𝑖𝑖𝑗𝑗𝑘𝑘

Step 2: Tensor Access Formula:

𝑆𝑆𝑇𝑇𝑖𝑖𝑗𝑗𝑘𝑘

= 𝑚𝑚𝑛𝑛

Combine two steps together:

𝑆𝑆𝑇𝑇 · 𝑆𝑆𝑆𝑆𝑆𝑆−1𝑥𝑥𝑦𝑦𝑡𝑡

= 𝑚𝑚𝑛𝑛

13

Page 14: Tensorlib: A Spatial Accelerator Generation Framework for

x

y t

x

y t

Reuse Space and PE-Memory Interconnection

Reuse subspace In (x, y, t) domain

PE-MemoryInterconnection

Example

Each PE reads data once from memory bank

No data reuse

Reuse subspace forms a line in space-time

Reuse subspace forms a plane

0-D 1-D 2-Dx

y t

14

Page 15: Tensorlib: A Spatial Accelerator Generation Framework for

Reuse direction 𝑥𝑥 = 𝑝𝑝𝑡𝑡

space parttime part

1-D Reuse Type Identification

Reuse Vector Direction

�⃗�𝑝 = 0𝑡𝑡 ≠ 0

�⃗�𝑝 ≠ 0𝑡𝑡 ≠ 0

�⃗�𝑝 ≠ 0𝑡𝑡 = 0

PEBehavior

Stay in the same PE during execution stage

Delay data before send to the next PE

Data arrive at PEs atthe same time

DataflowType Stationary Systolic Multicast

1-D reuse type forms a vector in space-time domain

15

Page 16: Tensorlib: A Spatial Accelerator Generation Framework for

Hardware Generation

16

Page 17: Tensorlib: A Spatial Accelerator Generation Framework for

Systolic Dataflow

Reuse Vector 𝑥𝑥 = �⃗�𝑝𝑡𝑡

�⃗�𝑝 ≠ 0𝑡𝑡 ≠ 0

• Transfer to PE 𝑥𝑥 + �⃗�𝑝 from PE 𝑥𝑥 after delaying 𝑡𝑡 cycles

• Use register to delay input data in each PE

ComputeCell

Shift regin out

ComputeCell

shiftregin out

PE Dataflow for Input Tensor

PE Dataflow for Output Tensor

17

Page 18: Tensorlib: A Spatial Accelerator Generation Framework for

Multicast Dataflow

Reuse Vector 𝑥𝑥 = �⃗�𝑝𝑡𝑡

�⃗�𝑝 = 0𝑡𝑡 ≠ 0

• Transfer to PEs in the same line simultaneously• Directly send to/from compute cell• Use reduction tree for output communication

ComputeCell

in reg

ComputeCell

Reduction tree

PE Dataflow for Input Tensor

PE Dataflow for Output Tensor

18

Page 19: Tensorlib: A Spatial Accelerator Generation Framework for

Stationary Dataflow

Reuse Vector 𝑥𝑥 = �⃗�𝑝𝑡𝑡

�⃗�𝑝 = 0𝑡𝑡 ≠ 0

• Keep stationary inside PE during execution

• Requires systolic dataflow to move in/out to/from PE array

• Use double buffer for interleaving two dataflows

regreg

ComputeCell

in

out

reg reg

in

out

ComputeCell

PE Dataflow for Input Tensor

PE Dataflow for Output Tensor

19

Page 20: Tensorlib: A Spatial Accelerator Generation Framework for

Systolic Stationary Multicast

Inputa c e

Outputb d f

Compute

reg regreg

ComputeCompute

reg

Compute

regreg reg

Compute Compute

Building PE Inner Architecture

comp.cell

in_C out_C?

in_B

out_B

in_A ? out_A

PE Basic Structure PE Dataflow Structure

?Substitute each of the module to one of the 6 dataflow modules based on the dataflow of each tensor

20

Page 21: Tensorlib: A Spatial Accelerator Generation Framework for

reg

reg

Building PE Architecture - Example

Compute

reg

Compute

reg

reg reg

Compute

compute

in_Cout_C

in_B

?

out_B

in_A ? out_A

reg reg?

Output Stationary Architecture

A: Systolic

B: Systolic

C: Stationary

21

Page 22: Tensorlib: A Spatial Accelerator Generation Framework for

reg

reg

Building PE Architecture - Example

Compute

reg

Compute

reg

regreg

Compute

compute

in_B

out_B

in_A

?

out_A

in_C out_C

Weight Stationary Architecture

regreg

?

?

A: Systolic

B: Stationary

C: Systolic

22

Page 23: Tensorlib: A Spatial Accelerator Generation Framework for

+ ++

+ ++

Building PE Array Interconnection - Example

Tensor A

Tensor B

Systolic [0, 1]

Systolic [1, 0] Systolic [1, 0]

Multicast [0, 1] Multicast [1, 0]

++

+

Reduction Tree [0, 1]

23

Page 24: Tensorlib: A Spatial Accelerator Generation Framework for

Implementation

PE BasicModule

PE DataflowTemplates

PEArchitecture

PE NetworkTemplates

PE ArrayArchitecture

MemoryTemplates

CompleteAccelerator

PE Config

NetworkConfig

MemoryConfig

class PEArray(params):

val PEs=Array[size]new PE(pe_config)

val pe_net = new PENetwork(net_config)

connect_PE(PEs, pe_net)val mem = Array[banks] new

MemBank(mem_config)connect_mem(pe_net, mem)

24

Page 25: Tensorlib: A Spatial Accelerator Generation Framework for

Experiments and Evaluation

25

Page 26: Tensorlib: A Spatial Accelerator Generation Framework for

Dataflow Performance Evaluation

• Evaluated 6 tensor applications

• Included Systolic, Multicast, Stationary, Unicast, Broadcast dataflows

• Performance variance because of dataflow mapping and bandwidth

Application variance Dataflow variance

26

Page 27: Tensorlib: A Spatial Accelerator Generation Framework for

21% Frequency Improvement15% Throughput Improvement

Susy [ICCAD’20] PolySA [ICCAD’18] TensorlibArria-10 Xilinx VU9P Xilinx VU9P

Application GEMM Conv GEMM Conv GEMM ConvLUT(%) 40% 35% 49% 49% 68% 75%DSP(%) 93% 84% 89% 89% 75% 75%

BRAM(%) 32% 30% 89% 71% 51% 73%Freq(MHz) 202 220 229 229 263 245

Throughput(Gop/s)

547 551 555 548 673 626

FPGA Performance Results

27

Page 28: Tensorlib: A Spatial Accelerator Generation Framework for

Summary

• Tensorlib: A Spatial Accelerator Generation Framework for Tensor

Algebra

• Dataflow Representation

• Analysis based on space-time transformation

• Hardware Generation

• Hierarchical bottom-up generation

• Open source: https://github.com/pku-liang/tensorlib

28

Page 29: Tensorlib: A Spatial Accelerator Generation Framework for

Framework Tutorial : Tensorlib API

Tensorlib provides API to define computation

Tensorlib API DescriptiongenIterators(x) Define x iterators and return the iterator listgenTensors(x) Define x tensors and return the tensor listsetExpr(expr) Define the compute expression with tensors and iterators

iterator.setRange(x) Set the range of iterators as xTensor.setWidth(x) Set the tensor data bitwidth as x

setLatency(x) Define the latency of computation pipelineuseCustomKernel Whether use customized IP for computation cell

29

Page 30: Tensorlib: A Spatial Accelerator Generation Framework for

Framework Tutorial: GEMM

Example: GEMM

Dataflow:output stationary Systolic array

for (i = 0; i < 3; i++)for (j = 0; j < 3; j++)for (k = 0; k < 3; k++)

C(i)(j) += A(i)(k) * B(k)(j)

30

Page 31: Tensorlib: A Spatial Accelerator Generation Framework for

Framework Tutorial: Example

val stt = DenseMatrix((1, 0, 0),(0, 1, 0),(1, 1, 1),)

Step 2: Space-Time matrix definition

val config = gen_dataflow(opSpec_Gemm,stt)

chisel3.Driver.execute(args, () => newPEArray(config))

Step 3: Generate dataflow and Verilog code

val opSpec_Gemm = new OperatorSpec {val i :: j :: k :: Nil = genIterators(3)val A :: B :: C :: Nil = genTensor(3)setExpr(C(i)(j) += A(i)(k) * B(k)(j))i.setRange(8)k.setRange(20)j.setRange(8)setLatency(1)A.setWidth(16)B.setWidth(16)C.setWidth(16)

}

Step 1: Workload Definition

31

Page 32: Tensorlib: A Spatial Accelerator Generation Framework for

Framework Tutorial: Using Pre-defined Configs

Running Tensorlib with Pre-defined dataflow configs

Config File Example (Can be generated by HASCO):

sbt "runMain tensorlib.ParseJsonexamples/gemm.json output/output.v"

{"benchmark": "GEMM","bitwidth": 16,"length":[8,8,8],"STT":[[1,0,0],[0,0,1],[1,1,1]]

}32

Page 33: Tensorlib: A Spatial Accelerator Generation Framework for

Framework Tutorial: Customized Kernels

Default Kernel: Integer Multiply-Add

Can be substituted with user-provided IPs

33

val opSpec = new OperatorSpec {......useCustomKernel(true)......

}

Step 1: Choose to use customized kernelIn computation definition

Step 2: Provide customized computation kernel & reduction kernel

module ComputeCell_Customized(input clock,input reset,input [15:0] io_data_2_in,input [15:0] io_data_1_in,input [15:0] io_data_0_in,output [15:0] io_data_0_out

);

module Reduction_Customized(input [15:0] io_in_a,input [15:0] io_in_b,output [15:0] io_out

);

Page 34: Tensorlib: A Spatial Accelerator Generation Framework for

Result Validation

Chisel Tester Validation

Runs a validation program of tensorlibgenerated GEMM accelerator based on Chisel cycle-level simulator

sbt "runMain tensorlib.Test_Gemm"

34