59
1 Systolic Architecture Design Shao-Yi Chien Some materials are referred to S. Y. Kung, VLSI Array Processors, Prentice-Hall, 1988.

Systolic Architecture Design - 國立臺灣大學

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Systolic Architecture Design - 國立臺灣大學

1

Systolic Architecture

Design

Shao-Yi Chien

Some materials are referred to

S. Y. Kung, VLSI Array Processors, Prentice-Hall, 1988.

Page 2: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 2

Introduction (1/3)

Systolic architecture (systolic array)

A network of processing elements (PEs) that rhythmically compute and pass data through the system

Modularity and regularity

All the PEs in the systolic array are uniform and fully pipelined

Contains only local interconnection

Page 3: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 3

Introduction (2/3)

Typical systolic array

Page 4: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 4

Introduction (3/3)

Some relaxations

Not only local but also neighbor

interconnections

Use of data broadcast operations

Use of different PEs in the system, especially

at the boundaries

Also called as “semi-systolic array”

Page 5: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 5

Systematic

Design

Methodology

Step 1 DG design

Step 2 Mapping to

DFG

Step 3 VLSI array

design

Page 6: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 6

Generalized Mapping

Page 7: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 7

Canonic Mapping Methodology

Applying linear mapping on regular dependence graph (DG)

The DG corresponds to a space representation where no time instance is assigned to any computation

Maps N-dimensional DG to a lower level dimensional systolic architecture

We first introduce N-dimension to (N-1)-dimension mapping

Page 8: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 8

Step 1: DG Design

Purpose: use graph to represent an

algorithm with parallel expression

Avoid unnecessary ordering information

introduced by sequential code

The important first step of systolic array

design

Page 9: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 9

DG Design

Write recursive form of an algorithm

Transform it as Single Assignment Code

Draw DG

Localized DG

Page 10: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 10

Example: Matrix-Vector

Multiplication (1/7)

Write recursive form of an algorithm

m

j

jiji bAc

Abc

1

for(i=1;i<=4;i++)

{

c(i)=0;

for(j=1;j<=4;j++)

{

c(i)=c(i)+A(i,j)*b(j);

}

}

Page 11: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 11

Example: Matrix-Vector

Multiplication (2/7)

Transform it as Single Assignment Code

A Single Assignment Code is a form where

every variable is assigned one value only

during the execution of the algorithm for(i=1;i<=4;i++)

{

c(i,1)=0;

for(j=1;j<=4;j++)

{

c(i,j+1)=c(i,j)+A(i,j)*b(j);

}

}

Page 12: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 12

Example: Matrix-Vector

Multiplication (3/7)

A recursive algorithm is inherently given in a

single assignment code

)(

),(

0

)(

)(

)1(

)()()()1(

jbb

jiAa

c

bacc

Abc

j

i

j

i

i

j

i

j

i

j

i

j

i

Broadcast Signal

Page 13: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 13

Example: Matrix-Vector

Multiplication (4/7)

Draw DG

A DG can be considered as the graphical

representation of a single assignment

algorithm

DG specifies all the dependencies between all

variables in the index space

An algorithm is computable if and only if its

complete DG contains no loops or cycles

Page 14: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 14

Example: Matrix-Vector

Multiplication (5/7)

for(i=1;i<=4;i++)

{

c(i,1)=0;

for(j=1;j<=4;j++)

{

c(i,j+1)=c(i,j)+A(i,j)*b(j);

}

}

)(

),(

0

)(

)(

)1(

)()()()1(

jbb

jiAa

c

bacc

Abc

j

i

j

i

i

j

j

j

i

j

i

j

i

Page 15: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 15

Example: Matrix-Vector

Multiplication (6/7)

Localized DG

Use transmittent data to replace broadcast

data

A locally recursive algorithm is an algorithm

whose corresponding DG has only local

dependencies

The length of each dependency arc is independent

of the problem size

Page 16: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 16

Example: Matrix-Vector

Multiplication (7/7)

b(1,1)=B(1);

b(1,2)=B(2);

b(1,3)=B(3);

b(1,4)=B(4);

for(i=1;i<=4;i++)

{

c(i,1)=0;

for(j=1;j<=4;j++)

{

b(i+1, j)=b(i,j);

c(i,j+1)=c(i,j)+A(i,j)*b(i, j);

}

}

Page 17: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 17

Example: Sorting

for(i=1;i<=N;i++)

for(j=1;j<=i;j++)

{

m(i+1, j)=max(x(i,j), m(i,j));

x(i, j+1 )=min(x(i,j), m(i,j));

}

Initialization

x(i,1)=x(i), the original sequence

m(i,i)=

Output of the sorter m(j)=m(N,j)

Input

Ou

tpu

t

Page 18: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 18

Example: Convolution

kjk

k

j

k

j

k

kjkj

wuyy

wuy

1

3

0

Localized DG

Page 19: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 19

Example: Autoregressive Filter

(AR Filter)

)3()2()1()0()4()4(

)()()(

1234

1

yayayayauy

jukjyajyN

k

k

Page 20: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 20

Example: Autoregressive Filter

(AR Filter)

Localized DG

Page 21: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 21

Example: FIR Filter

y(n)=w0x(n)+w1x(n-1)+w2x(n-2)

Input: edges in [i j]T=[0 1]T

Coefficient: edges in [1 0]T

Output: edges in [1 –1]T

Input X

Input X

Coefficient w Coefficient w

Output y'

Output y

y=y'+wX

Page 22: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 22

Step 2: Mapping to DFG (1/4)

Projection vector (iteration vector)

Two nodes that are displaced by d or

multiples of d are executed by the same

processor

Processor space vector

Any node with index IT=(i,j) would be executed

by processor Spatial domain

mapping vector

Page 23: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 23

Step 2: Mapping to DFG (2/4)

Scheduling vector

Any node with index I would be executed at

time sTI

Hardware utilization efficiency

Two tasks executed by the same processor are

spaced |sTd| time units apart

Edge mapping

For an edge e in the DG, an edge pTe is

introduced in the systolic array with sTe delays

Temporal domain

mapping vector

Page 24: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 24

Step 2: Mapping to DFG (3/4)

Constraints

Processor space vector p and the project

vector d must be orthogonal to each other

If A and B are mapped to the same processor,

then they cannot be executed at the same

time

sTIB>sTIA sTd>0

Page 25: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 25

Step 2: Mapping to DFG (4/4)

Space-time transformation

=0

=0

Page 26: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 26

FIR Systolic Arrays – B1 (1/5)

Design B1 (broadcast input, move result,

weight stay)

d

p

s

Page 27: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 27

FIR Systolic Arrays – B1 (2/5)

Node IT(i,j) is mapped to processor

Node IT(I,j) is executed at time

HUE

Page 28: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 28

FIR Systolic Arrays – B1 (3/5)

Edge mapping

d

p

s

D

D

w

x y

D

D

D

D

D

0

Page 29: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 29

FIR Systolic Arrays – B1 (4/5)

Page 30: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 30

FIR Systolic Arrays – B1 (5/5)

Constraints

00

110

dpT

010

101

dsT

d

p

s

Page 31: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 31

FIR Systolic Arrays – B2 (1/4)

Design B2 (broadcast input, move weight,

result stay)

d

p

s

Page 32: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 32

FIR Systolic Arrays – B2 (2/4)

Space-time mapping

Page 33: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 33

FIR Systolic Arrays – B2 (3/4)

d

p

s

D

D

D

D

D

D

D

D

......

wx

y

D

D

D

D

D

D

D

Page 34: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 34

FIR Systolic Arrays – B2 (4/4)

Page 35: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 35

FIR Systolic Arrays – F (1/3)

Design F (fan-in results, move inputs,

weight stay)

d

p

s

Page 36: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 36

FIR Systolic Arrays – F (2/3)

Space-time mapping

Page 37: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 37

FIR Systolic Arrays – F (3/3)

Page 38: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 38

Relationship to Other

Transformations (1/2)

Systolic array architectures with same

project vector and processor space vector,

but different scheduling vectors can be

derived from other transformations

Edge reversal, associativity, slow-down,

retiming, and pipelining

Page 39: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 39

Relationship to Other

Transformations (2/2)

B1

F

Cutset retiming

Page 40: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 40

Example: Matrix-Vector

Multiplication

dT=[1 0]T, pT=[0 1]T, sT=[1 1]T

Page 41: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 41

Example: Sorting

Insertion Sorting

Selection Sorting Bubble Sorting

Default scheduling

d||s

Page 42: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 42

Matrix-Matrix Multiplication (1/3)

How about more-than-two dimensional DG?

See this example: matrix-matrix

multiplication

C=AB, C, A, B are nxn matrices

For n=2

Page 43: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 43

Matrix-Matrix Multiplication (2/3)

Solution 1:

Solution 2:

Edge mapping:

0 1 0

0 0 1 ,

1

0

0

,)1,1,1( TT pds

1 1 0

1 0 1 ,

1

1

1

,)1,1,1( TT pds

Page 44: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 44

Matrix-Matrix Multiplication (3/3)

Solution 1 Solution 2

Page 45: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 45

Selection of Schedule Vector

Choose the schedule vector s first, then d and p can be selected according to

Procedure to get sT

Capture all the fundamental edges in reduced dependence graph (RDG) constructed by regular iteration algorithm (RIA) description

Construct the scheduling inequalities

Solve feasible sT (sTd=1 or sTd=iteration bound is optimal)

0 ,0 dsdp TT

Page 46: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 46

Scheduling Inequalities (1/2)

For the edge XY

The scheduling inequality is

Sx: scheduling time for note X

Sy: scheduling time for node Y

Tx: computation time of node X

Page 47: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 47

Scheduling Inequalities (2/2)

Linear scheduling

Affine scheduling

So the scheduling inequalities:

Page 48: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 48

RDG from RIA (1/2)

For the FIR filter

Standard output RIA form

Page 49: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 49

RDG from RIA (2/2)

RDG

Page 50: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 50

Scheduling Vector Selection

Assume

Set

One solution:

Page 51: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 51

Systolic Architecture Mapping

Then also choose d=(1,-1), pT=(1, 1)

Page 52: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 52

Stage 3: VLSI Array Design (1/2)

Lots of choices

Single Instruction Multiple Data (SIMD)

stream array

Systolic array

Wavefront array

SFG array

Page 53: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 53

Stage 3: VLSI Array Design (2/2)

SIMD Array Systolic Array

DFG Array

Page 54: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 54

Multiprojection

Systolic Design for Space Representations

with Delays

Can be used for multilevel systolic mapping

Define N’: number of nodes mapped to a

processor

Iteration period: N’|sTd|

l-th iteration

(l+w)-th iteration

Page 55: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 55

Scheduling Inequality and

Systolic Transformation

Scheduling inequality

All the mapping equations are the same

except for the edge delay mapping

||' dswNeses TTT

Page 56: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 56

Example of DG with Delays (1/3)

Page 57: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 57

Example of DG with Delays (2/3)

Assume N’=4

sT=(1 1), pT=(1 0)

Page 58: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 58

Example of DG with Delays (3/3)

1-D systolic array

Page 59: Systolic Architecture Design - 國立臺灣大學

DSP in VLSI Design Shao-Yi Chien 59

Remark

The performance of the resulting array is

affected by

The choice of a particular DG for an algorithm

The direction of the projection and the

schedule vectors