Chapter 7 Systolic Arrays Systolic Architecture - York University

1

YORK UNIVERSITY CSE4210

Chapter 7Systolic Arrays

Mokhtar AboelazeCSE4210 Winter 2012


Systolic Architecture

• A number of usually similar processing elements connected together to implement a specific algorithm.

• Data move between PE’s in a rhythmic fashion.

PE PE PE

Processor

. . .

2


Systolic Architecture• Typically, fully pipelined (all communication

between PE’s contain delay element (why?). Also communication between neighboring PE’s only.

• Some relaxation techniques can get rid of the delay. Also, there may be communication between close bet not neighboring PE’s

• Some processors (especially boundary ones may be different than the rest.

• Could be used as a coprocessor


Design Methodology• Using linear mapping techniques from the

dependence space to the space-time• Usually, algorithm is described by a dependence

graph.• Dependence graph is regular if the presence of

any edge connected to a node, means the existence of a similar edge in every node.

• There is no concept of time in the dependence graph.

3


FIR Filter

x0 x1 x2 x3

ω2

ω1

ω0

• Y(n)=ω0 x(n)+ ω1 x(n-1) + ω2 x(n-2)

• Data is moving in three directions

• X in [0 1]T

• ω in [1 0]T

• Y in [1 -1]T


Design Methodology• We map the N-dimensional DG to a lower

dimension systolic architecture• Three vectors are introduced• Projection vector d=[d1 d2]T• Processor space vector PT = [p1 p2]• Scheduling vector ST=[S1 S2]

• Hardware Utilization Efficiency =1/|STd|

4


Design Methodology• Projection vector

– Two nodes are displaced by d or multiple of it, are mapped to the same processor

• Processor space vector– Any node in the DG I (IT=(i,j)) is mapped to processor

PTI• Scheduling vector

– Any node in the DG I would be executed at time STI• Subject to some constraints


Design Methodology• Steps

– Represent algorithm as a DG– Apply mapping (projection and scheduling)– Edge mapping

• If an edge e exists in the DG, then an edge PTe is introduced in the systolic array with STe delay

– Construct the systolic array

5


Design Methodology• Constraints

– Processor space vector and the projection vector must be orthogonal PTD=0. if IA-IB = multiple of d, they are executed by the same processor

– If A and B are mapped to the same processor, they should not be executed at the same time STIA ≠ STIB i.e. STd≠0


Example -- IIR• Y(n)=ω0 x(n)+ ω1 x(n-1) + ω2 x(n-2)

y0 y1 y2 y3

x0 x1 x2 x3

ω2

ω1

ω0

Single assignment format with broadcasting data:

Do n=1,2, . . .y1(n,-1)=0Do k=0,Ky1(n,k)=y1(n,k-1)

+w(k)*x(n-k)enddoy(n)=y1(n,K)

Enddo

6


Design I[ ] [ ]0110

01

==⎥⎦

⎤⎢⎣

⎡= TT SPd

1-1Y(1 -1)

01X(0 1)

10ω(1 0)

STePTeEdge eω ω ω

yx

D D

If an edge e, then an edge PTeis introduced in the array with delay STe

Weights stay, broadcast input, move results


Design Ix0 x1 x2 x3

ω2

ω1

ω0

time0 1 2 3

1

2

3pr

oces

sor

Point I is executed in PE PTI at time STI

Point (i,j) is executed at PE j at time i

D ⊗⊕

ω

D ⊗⊕

ω

D ⊗⊕

ω

DD

7


⊗ ⊗ ⊗

⊕ ⊕ ⊕DD

DDDω2ω1ω0

X3

⊗ ⊗ ⊗

⊕ ⊕ ⊕DD

DDDω2ω1ω0

X4

⊗ ⊗ ⊗

⊕ ⊕ ⊕DD

DDDω2ω1ω0

X5

?

?

?


Design II[ ] [ ]0111

11

==⎥⎦

⎤⎢⎣

⎡−

= TT SPd

10Y(1 -1)

01X(0 1)

11W(1 0)

STePTeEdge e

ω D D

X[n]

Y[n]

Point I is executed in PE PTI at time STI

Point (i,j) is executed at PE i+j at time i

Broadcast input, move weights, result stay

8


Design II


Design II• How can you square the previous design

with.• Point (i,j) is executed at PE i+j at time i

x0 x1 x2 x3

ω2

ω1

ω00,0

x,y means at time x in processor y

1,0

2,0

0,1

2,1

3,1

2,2

3,2

4,2 5,3

4,3

3,3

9


Design III[ ] [ ]1110

01

==⎥⎦

⎤⎢⎣

⎡= TT SPd

0-1Y(1 -1)

11X(0 1)

10W(1 0)

STePTeEdge e

Weights stay, move input, fan in output


Design III

10


Design IV[ ] [ ]1111

11

−==⎥⎦

⎤⎢⎣

⎡−

= TT SPd

20Y(1 -1)

1-1X(0 1)

11W(1 0)

STePTeEdge e


Design IV

D

11


Design V[ ] [ ]1211

11

==⎥⎦

⎤⎢⎣

⎡−

= TT SPd

10Y(1 -1)

11X(0 1)

21W(1 0)

STePTeEdge e


Design V

D

D

D D

D

⊗ ⊕D

D

D

12


Dual

[ ] [ ]21111

1==⎥

⎦

⎤⎢⎣

⎡−

= TT SPd

10Y(1 -1)

21X(0 1)

11W(1 0)

STePTeEdge e

Dual of the previous design.

X and w are exchanged


Design VI[ ] [ ]1110

01

−==⎥⎦

⎤⎢⎣

⎡= TT SPd

2-1Y(1 -1)

1-1X(0 1)

10W(1 0)

STePTeEdge e

PE0 PE1 PE0x D D

2D 2D

13


Dual


Transformation

14


Scheduling Vector• Consider the dependence X Y• Y can start after X has started and

completed.• We also have to take into consideration the

time it will take the data to travel from X to Y• Constraints on the scheduling vector.


yy

yy

Ty

xx

xx

Tx

y

yy

Ty

x

xx

Tx

xxy

y

yY

x

xx

Ji

ssISS

Ji

ssISS

Ji

ssISS

Ji

ssISS

TSS

Ji

IYJi

IX

γ

γ

+⎟⎟⎠

⎞⎜⎜⎝

⎛==

+⎟⎟⎠

⎞⎜⎜⎝

⎛==

⎟⎟⎠

⎞⎜⎜⎝

⎛==

⎟⎟⎠

⎞⎜⎜⎝

⎛==

+≥

⎟⎟⎠

⎞⎜⎜⎝

⎛=→⎟⎟

⎠

⎞⎜⎜⎝

⎛=

)(

)(

)(

)(

::

21

21

21

21Linear

scheduling

Affine scheduling

15


xxyyxT

xxxT

yyT

TeS

TISIS

≥−+

++≥+

→ γγ

γγ

edgean for inequality scheduling The

,schedulingaffine Using

Assume that ex y=Iy – Ix


Scheduling Vector• Capture all the fundamental edges

(Reduced Dependence Graph RDG).• Use the Regular Iterative Algorithm (RIA)

to describe the problem.• Construct the scheduling inequalities and

solve them for a possible ST

16


RIA Description• The regular iterative algorithm has two

standard forms• Standard Input if the index of the inputs

are all the same for all equations• Standard Output if the index of the output

are all the same for all equations


RIA Description• W(i+1,j) = W(i,j)• X(i,j+1) = X(i,j)• Y(i+1,j-1) = Y(i,j)+W(i+1,j-1)X(i+1,j-1)x0 x1 x2 x3

ω2

ω1

ω0

Not RIA Indices are

not the same

17


RIA Description• W(i,j) = W(i-1,j)• X(i,j) = X(i,j-1)• Y(i,j) = Y(i-1,j+1)+W(i,j)X(i,j)x0 x1 x2 x3

ω2

ω1

ω0

RIA


RIA Description

),(),()1,1(),()1,(),(),1(),(

jiXjiWjiYjiYjiXjiX

jiWjiW

++−=−=

−=

Reduced RIA graph for the FIR filter

18


[ ]

[ ]

[ ]

[ ]

[ ] 125,1

1:

0,00

:

1,01

:

1,10

:

0,00

:

2121

21

121

221

21

++≥−+−⎟⎟⎠

⎞⎜⎜⎝

⎛−

=→

≥−⎟⎟⎠

⎞⎜⎜⎝

⎛=→

≥−+⎟⎟⎠

⎞⎜⎜⎝

⎛=→

≥−+⎟⎟⎠

⎞⎜⎜⎝

⎛=→

≥−⎟⎟⎠

⎞⎜⎜⎝

⎛=→

yy

xy

ww

xx

wy

sssseYY

sseYX

ssseWW

ssseXX

sseYW

γγ

γγ

γγ

γγ

γγ

xxyyxT Tes ≥−+→ γγ Tmul=5,Tad=2

Tcomm=1


FIR• Solving the set of equation assuming all γ’s to be zero.

• A possible solution is s=[9 1]• A possible selection for d=[1,-1] and p =

[1 1]• sTd=8,HUE =1/ 8 eT PTe STe

W(1,0) 1 9

X(0,1) 1 1

Y(1,-1) 0 8

19


FIRDD

9D 9D

X

W

8D8D8D


Matrix Multiplication

2222122122

2122112121

2212121112

2112111111

2221

1211

2221

1211

2221

1211

babacbabacbabacbabac

bbbb

aaaa

cccc

+=+=+=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟

⎠

⎞⎜⎜⎝

⎛

20





21



Taccu = 1 T com = 0


Matrix Multiplication[ ] [ ]

⎥⎦

⎤⎢⎣

⎡=

==

010001

100111

p

dS TT

22


Solution 2

HUE = 1

[ ] [ ] ⎥⎦

⎤⎢⎣

⎡=−==

110101

,111111 pdS TT


2,2 3,2 4,2

2,3 3,3 4,3

2,4 3,4 4,4

0,0 1,0

0,1 1,1 2,1 3,1 4,1

2,0 3,0 4,0

0,2 1,2

0,3 1,3

0,4 1,4

a20 a21 a22a10 a11 a12

a00 a01 a02

b00

b01 b10

b02 b11 b20

b12 b21

b22

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

D DD

DD

D DD

D D

D D D

D

D

D

D

D

D

DD

D

D

D

D

D D

DD

23


Solution 3

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛==

010101

,101

,111 TT PdS


Solution 4

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−−

=⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛==

110101

,111

,111 TT PdS

24


Solution 5

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

−==

001110

,1

10

,121 TT PdS

Vector PTe sTe

a(0,1,0) 1,0 2

b(1,0,0) 0,1 1

c(0,0,1) 1,0 1


0,0 1,0 2,0 3,0 4,0

0,1 1,1 2,1 3,1 4,1

0,2 1,2 2,2 3,2 4,2

a00b00

a01

b10

a02

b22

b01

b11b21b02

b12

a10 a11

b20

a12

a20 a21 a22

Assuming one delay element in every PE

25


Solution 6

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−−−

=⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛==

110111

,112

,111 TT PdS


Solution 7

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−

=⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−−

==011011

,211

,121 TT PdS

26


Solution 4

Documents

Chapter 7 Systolic Arrays Systolic Architecture - York University