26
1 Chapter 7 Systolic Arrays Mokhtar Aboelaze CSE4210 Winter 2012 YORK UNIVERSITY CSE4210 Systolic Architecture A number of usually similar processing elements connected together to implement a specific algorithm. Data move between PE’s in a rhythmic fashion. PE PE PE Processor . . .

Chapter 7 Systolic Arrays Systolic Architecture - York University

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 7 Systolic Arrays Systolic Architecture - York University

1

YORK UNIVERSITY CSE4210

Chapter 7Systolic Arrays

Mokhtar AboelazeCSE4210 Winter 2012

YORK UNIVERSITY CSE4210

Systolic Architecture

• A number of usually similar processing elements connected together to implement a specific algorithm.

• Data move between PE’s in a rhythmic fashion.

PE PE PE

Processor

. . .

Page 2: Chapter 7 Systolic Arrays Systolic Architecture - York University

2

YORK UNIVERSITY CSE4210

Systolic Architecture• Typically, fully pipelined (all communication

between PE’s contain delay element (why?). Also communication between neighboring PE’s only.

• Some relaxation techniques can get rid of the delay. Also, there may be communication between close bet not neighboring PE’s

• Some processors (especially boundary ones may be different than the rest.

• Could be used as a coprocessor

YORK UNIVERSITY CSE4210

Design Methodology• Using linear mapping techniques from the

dependence space to the space-time• Usually, algorithm is described by a dependence

graph.• Dependence graph is regular if the presence of

any edge connected to a node, means the existence of a similar edge in every node.

• There is no concept of time in the dependence graph.

Page 3: Chapter 7 Systolic Arrays Systolic Architecture - York University

3

YORK UNIVERSITY CSE4210

FIR Filter

x0 x1 x2 x3

ω2

ω1

ω0

• Y(n)=ω0 x(n)+ ω1 x(n-1) + ω2 x(n-2)

• Data is moving in three directions

• X in [0 1]T

• ω in [1 0]T

• Y in [1 -1]T

YORK UNIVERSITY CSE4210

Design Methodology• We map the N-dimensional DG to a lower

dimension systolic architecture• Three vectors are introduced• Projection vector d=[d1 d2]T• Processor space vector PT = [p1 p2]• Scheduling vector ST=[S1 S2]

• Hardware Utilization Efficiency =1/|STd|

Page 4: Chapter 7 Systolic Arrays Systolic Architecture - York University

4

YORK UNIVERSITY CSE4210

Design Methodology• Projection vector

– Two nodes are displaced by d or multiple of it, are mapped to the same processor

• Processor space vector– Any node in the DG I (IT=(i,j)) is mapped to processor

PTI• Scheduling vector

– Any node in the DG I would be executed at time STI• Subject to some constraints

YORK UNIVERSITY CSE4210

Design Methodology• Steps

– Represent algorithm as a DG– Apply mapping (projection and scheduling)– Edge mapping

• If an edge e exists in the DG, then an edge PTe is introduced in the systolic array with STe delay

– Construct the systolic array

Page 5: Chapter 7 Systolic Arrays Systolic Architecture - York University

5

YORK UNIVERSITY CSE4210

Design Methodology• Constraints

– Processor space vector and the projection vector must be orthogonal PTD=0. if IA-IB = multiple of d, they are executed by the same processor

– If A and B are mapped to the same processor, they should not be executed at the same time STIA ≠ STIB i.e. STd≠0

YORK UNIVERSITY CSE4210

Example -- IIR• Y(n)=ω0 x(n)+ ω1 x(n-1) + ω2 x(n-2)

y0 y1 y2 y3

x0 x1 x2 x3

ω2

ω1

ω0

Single assignment format with broadcasting data:

Do n=1,2, . . .y1(n,-1)=0Do k=0,Ky1(n,k)=y1(n,k-1)

+w(k)*x(n-k)enddoy(n)=y1(n,K)

Enddo

Page 6: Chapter 7 Systolic Arrays Systolic Architecture - York University

6

YORK UNIVERSITY CSE4210

Design I[ ] [ ]0110

01

==⎥⎦

⎤⎢⎣

⎡= TT SPd

1-1Y(1 -1)

01X(0 1)

10ω(1 0)

STePTeEdge eω ω ω

yx

D D

If an edge e, then an edge PTeis introduced in the array with delay STe

Weights stay, broadcast input, move results

YORK UNIVERSITY CSE4210

Design Ix0 x1 x2 x3

ω2

ω1

ω0

time0 1 2 3

1

2

3pr

oces

sor

Point I is executed in PE PTI at time STI

Point (i,j) is executed at PE j at time i

D ⊗⊕

ω

D ⊗⊕

ω

D ⊗⊕

ω

DD

Page 7: Chapter 7 Systolic Arrays Systolic Architecture - York University

7

YORK UNIVERSITY CSE4210

⊗ ⊗ ⊗

⊕ ⊕ ⊕DD

DDDω2ω1ω0

X3

⊗ ⊗ ⊗

⊕ ⊕ ⊕DD

DDDω2ω1ω0

X4

⊗ ⊗ ⊗

⊕ ⊕ ⊕DD

DDDω2ω1ω0

X5

?

?

?

YORK UNIVERSITY CSE4210

Design II[ ] [ ]0111

11

==⎥⎦

⎤⎢⎣

⎡−

= TT SPd

10Y(1 -1)

01X(0 1)

11W(1 0)

STePTeEdge e

ω D D

X[n]

Y[n]

Point I is executed in PE PTI at time STI

Point (i,j) is executed at PE i+j at time i

Broadcast input, move weights, result stay

Page 8: Chapter 7 Systolic Arrays Systolic Architecture - York University

8

YORK UNIVERSITY CSE4210

Design II

YORK UNIVERSITY CSE4210

Design II• How can you square the previous design

with.• Point (i,j) is executed at PE i+j at time i

x0 x1 x2 x3

ω2

ω1

ω00,0

x,y means at time x in processor y

1,0

2,0

0,1

2,1

3,1

2,2

3,2

4,2 5,3

4,3

3,3

Page 9: Chapter 7 Systolic Arrays Systolic Architecture - York University

9

YORK UNIVERSITY CSE4210

Design III[ ] [ ]1110

01

==⎥⎦

⎤⎢⎣

⎡= TT SPd

0-1Y(1 -1)

11X(0 1)

10W(1 0)

STePTeEdge e

Weights stay, move input, fan in output

YORK UNIVERSITY CSE4210

Design III

Page 10: Chapter 7 Systolic Arrays Systolic Architecture - York University

10

YORK UNIVERSITY CSE4210

Design IV[ ] [ ]1111

11

−==⎥⎦

⎤⎢⎣

⎡−

= TT SPd

20Y(1 -1)

1-1X(0 1)

11W(1 0)

STePTeEdge e

YORK UNIVERSITY CSE4210

Design IV

D

Page 11: Chapter 7 Systolic Arrays Systolic Architecture - York University

11

YORK UNIVERSITY CSE4210

Design V[ ] [ ]1211

11

==⎥⎦

⎤⎢⎣

⎡−

= TT SPd

10Y(1 -1)

11X(0 1)

21W(1 0)

STePTeEdge e

YORK UNIVERSITY CSE4210

Design V

D

D

D D

D

⊗ ⊕D

D

D

Page 12: Chapter 7 Systolic Arrays Systolic Architecture - York University

12

YORK UNIVERSITY CSE4210

Dual

[ ] [ ]21111

1==⎥

⎤⎢⎣

⎡−

= TT SPd

10Y(1 -1)

21X(0 1)

11W(1 0)

STePTeEdge e

Dual of the previous design.

X and w are exchanged

YORK UNIVERSITY CSE4210

Design VI[ ] [ ]1110

01

−==⎥⎦

⎤⎢⎣

⎡= TT SPd

2-1Y(1 -1)

1-1X(0 1)

10W(1 0)

STePTeEdge e

PE0 PE1 PE0x D D

2D 2D

Page 13: Chapter 7 Systolic Arrays Systolic Architecture - York University

13

YORK UNIVERSITY CSE4210

Dual

YORK UNIVERSITY CSE4210

Transformation

Page 14: Chapter 7 Systolic Arrays Systolic Architecture - York University

14

YORK UNIVERSITY CSE4210

Scheduling Vector• Consider the dependence X Y• Y can start after X has started and

completed.• We also have to take into consideration the

time it will take the data to travel from X to Y• Constraints on the scheduling vector.

YORK UNIVERSITY CSE4210

yy

yy

Ty

xx

xx

Tx

y

yy

Ty

x

xx

Tx

xxy

y

yY

x

xx

Ji

ssISS

Ji

ssISS

Ji

ssISS

Ji

ssISS

TSS

Ji

IYJi

IX

γ

γ

+⎟⎟⎠

⎞⎜⎜⎝

⎛==

+⎟⎟⎠

⎞⎜⎜⎝

⎛==

⎟⎟⎠

⎞⎜⎜⎝

⎛==

⎟⎟⎠

⎞⎜⎜⎝

⎛==

+≥

⎟⎟⎠

⎞⎜⎜⎝

⎛=→⎟⎟

⎞⎜⎜⎝

⎛=

)(

)(

)(

)(

::

21

21

21

21Linear

scheduling

Affine scheduling

Page 15: Chapter 7 Systolic Arrays Systolic Architecture - York University

15

YORK UNIVERSITY CSE4210

xxyyxT

xxxT

yyT

TeS

TISIS

≥−+

++≥+

→ γγ

γγ

edgean for inequality scheduling The

,schedulingaffine Using

Assume that ex y=Iy – Ix

YORK UNIVERSITY CSE4210

Scheduling Vector• Capture all the fundamental edges

(Reduced Dependence Graph RDG).• Use the Regular Iterative Algorithm (RIA)

to describe the problem.• Construct the scheduling inequalities and

solve them for a possible ST

Page 16: Chapter 7 Systolic Arrays Systolic Architecture - York University

16

YORK UNIVERSITY CSE4210

RIA Description• The regular iterative algorithm has two

standard forms• Standard Input if the index of the inputs

are all the same for all equations• Standard Output if the index of the output

are all the same for all equations

YORK UNIVERSITY CSE4210

RIA Description• W(i+1,j) = W(i,j)• X(i,j+1) = X(i,j)• Y(i+1,j-1) = Y(i,j)+W(i+1,j-1)X(i+1,j-1)x0 x1 x2 x3

ω2

ω1

ω0

Not RIA Indices are

not the same

Page 17: Chapter 7 Systolic Arrays Systolic Architecture - York University

17

YORK UNIVERSITY CSE4210

RIA Description• W(i,j) = W(i-1,j)• X(i,j) = X(i,j-1)• Y(i,j) = Y(i-1,j+1)+W(i,j)X(i,j)x0 x1 x2 x3

ω2

ω1

ω0

RIA

YORK UNIVERSITY CSE4210

RIA Description

),(),()1,1(),()1,(),(),1(),(

jiXjiWjiYjiYjiXjiX

jiWjiW

++−=−=

−=

Reduced RIA graph for the FIR filter

Page 18: Chapter 7 Systolic Arrays Systolic Architecture - York University

18

YORK UNIVERSITY CSE4210

[ ]

[ ]

[ ]

[ ]

[ ] 125,1

1:

0,00

:

1,01

:

1,10

:

0,00

:

2121

21

121

221

21

++≥−+−⎟⎟⎠

⎞⎜⎜⎝

⎛−

=→

≥−⎟⎟⎠

⎞⎜⎜⎝

⎛=→

≥−+⎟⎟⎠

⎞⎜⎜⎝

⎛=→

≥−+⎟⎟⎠

⎞⎜⎜⎝

⎛=→

≥−⎟⎟⎠

⎞⎜⎜⎝

⎛=→

yy

xy

ww

xx

wy

sssseYY

sseYX

ssseWW

ssseXX

sseYW

γγ

γγ

γγ

γγ

γγ

xxyyxT Tes ≥−+→ γγ Tmul=5,Tad=2

Tcomm=1

YORK UNIVERSITY CSE4210

FIR• Solving the set of equation assuming all γ’s to be zero.

• A possible solution is s=[9 1]• A possible selection for d=[1,-1] and p =

[1 1]• sTd=8,HUE =1/ 8 eT PTe STe

W(1,0) 1 9

X(0,1) 1 1

Y(1,-1) 0 8

Page 19: Chapter 7 Systolic Arrays Systolic Architecture - York University

19

YORK UNIVERSITY CSE4210

FIRDD

9D 9D

X

W

8D8D8D

YORK UNIVERSITY CSE4210

Matrix Multiplication

2222122122

2122112121

2212121112

2112111111

2221

1211

2221

1211

2221

1211

babacbabacbabacbabac

bbbb

aaaa

cccc

+=+=+=+=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟

⎞⎜⎜⎝

Page 20: Chapter 7 Systolic Arrays Systolic Architecture - York University

20

YORK UNIVERSITY CSE4210

Matrix Multiplication

YORK UNIVERSITY CSE4210

Matrix Multiplication

Page 21: Chapter 7 Systolic Arrays Systolic Architecture - York University

21

YORK UNIVERSITY CSE4210

Matrix Multiplication

Taccu = 1 T com = 0

YORK UNIVERSITY CSE4210

Matrix Multiplication[ ] [ ]

⎥⎦

⎤⎢⎣

⎡=

==

010001

100111

p

dS TT

Page 22: Chapter 7 Systolic Arrays Systolic Architecture - York University

22

YORK UNIVERSITY CSE4210

Solution 2

HUE = 1

[ ] [ ] ⎥⎦

⎤⎢⎣

⎡=−==

110101

,111111 pdS TT

YORK UNIVERSITY CSE4210

2,2 3,2 4,2

2,3 3,3 4,3

2,4 3,4 4,4

0,0 1,0

0,1 1,1 2,1 3,1 4,1

2,0 3,0 4,0

0,2 1,2

0,3 1,3

0,4 1,4

a20 a21 a22a10 a11 a12

a00 a01 a02

b00

b01 b10

b02 b11 b20

b12 b21

b22

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

DD

D

D DD

DD

D DD

D D

D D D

D

D

D

D

D

D

DD

D

D

D

D

D D

DD

Page 23: Chapter 7 Systolic Arrays Systolic Architecture - York University

23

YORK UNIVERSITY CSE4210

Solution 3

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

⎟⎟⎟

⎜⎜⎜

⎛==

010101

,101

,111 TT PdS

YORK UNIVERSITY CSE4210

Solution 4

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−−

=⎟⎟⎟

⎜⎜⎜

⎛==

110101

,111

,111 TT PdS

Page 24: Chapter 7 Systolic Arrays Systolic Architecture - York University

24

YORK UNIVERSITY CSE4210

Solution 5

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎟

⎜⎜⎜

−==

001110

,1

10

,121 TT PdS

Vector PTe sTe

a(0,1,0) 1,0 2

b(1,0,0) 0,1 1

c(0,0,1) 1,0 1

YORK UNIVERSITY CSE4210

0,0 1,0 2,0 3,0 4,0

0,1 1,1 2,1 3,1 4,1

0,2 1,2 2,2 3,2 4,2

a00b00

a01

b10

a02

b22

b01

b11b21b02

b12

a10 a11

b20

a12

a20 a21 a22

Assuming one delay element in every PE

Page 25: Chapter 7 Systolic Arrays Systolic Architecture - York University

25

YORK UNIVERSITY CSE4210

Solution 6

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−−−

=⎟⎟⎟

⎜⎜⎜

⎛==

110111

,112

,111 TT PdS

YORK UNIVERSITY CSE4210

Solution 7

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−

=⎟⎟⎟

⎜⎜⎜

⎛−−

==011011

,211

,121 TT PdS

Page 26: Chapter 7 Systolic Arrays Systolic Architecture - York University

26

YORK UNIVERSITY CSE4210

Solution 4