VLSI programming Systolic Design - TU/ewsinmak/Education/2IMN35/2IMN35-2016... · 2016-05-18 · 18-May-16 Rudolf Mak TU/e Computer Science Systolic 1 VLSI programming Systolic Design

18-May-16 Rudolf Mak TU/e Computer Science Systolic 1

VLSI programming

Systolic DesignBook Parhi, Chp. 7

Rudolf Mak

[email protected]


Agenda

• Systolic arrays (what, where)

• Regular Iterative Algorithms (RIAs)

• Dependence graphs (regular, reduced)

• Systolic design techniques

– Binding (computations to PEs)

– Scheduling (computations to time slots)

• Examples

– Fir filters, matrix multipliers


FSM reminder

Moore machine Mealy machine

CL

state

CL

state

Chaining Mealy machines may lead too long critical paths!


Systolic system (Leiserson)

A systolic system is a set of interconnected Moore

machines that operate synchronously and satisfy

certain smallness (boundedness) conditions:

1. # states is bounded

2. # input ports is bounded

3. # output ports is bounded

4. # neighbor machines is bounded

“#” stands for “number of”


Systolic = Uniform Pipelined SDF

• Uniform:– Each PE (Moore machine) computes the same

set of combinatorial functions.

• Regular:– All PEs are connected to a small finite number of

neighboring PEs via one or more D-elements according to a regular topology. All connections are point-to-point connections.

• Synchronous operation: – All PEs operate in lock step (fire concurrently) ;

data is pumped through the system, much like the hart pumps blood through the body (hence the name systolic).


Relaxations

• To obtain better systems small relaxations to the systolic model are allowed:

1. Not all PEs are identical, small deviations are allowed especially for PEs at the border of the system.

2. (A limited form) of broadcasting is allowed. This means that PEs have become Mealy machines. 1. These systems are called semi-systolic by Leiserson.2. Parhi does not make the distinction. Instead he uses the

notion fully pipelined for the Moore machine variant.

3. Connections need not be to nearest neighbors, but locality needs to be maintained.


Systolic system

PE

Host

PE PEPEPE

Systolic array: Moore machines

Turing-equivalent machine

Such as a

Power PC

on a FPGA

Such as a dedicated computing engine on a FPGA


Application areas

• Computationally intensive, regular

– Basic linear algebra operations

– Signal processing

– Image processing

– Order statistics, sorting

– Dynamic programming

– High performance computing

• e.g., many particle simulations (in chemistry,

physics or astronomy)

FIR filter (N-tap)

� � � � � � � � � , 0 � � ��

� �, � � � � � � � � � � , 0 � � � �

��

� �, 0 � �� , � � � � � � �� 1 � � � 1 �

�� 1, � � 1�� , � � 0� �, � � � � � �� 1, �� , � � � � � ��, � 1�


or � ��, � � 1�

Spec

RIA

� � 1 � � 1does not work!!!


Regular Iterative Algorithm

A RIA is a triple consisting of1. An index space

2. A finite set of variables

3. A set of direct dependencies among indexed variables (given as equalities)

• with associated index displacement vectors

• also called fundamental edges by Parhi

Canonical forms:

1. Standard input

2. Standard output

{��, ��|0 � �, 0 � � � � !�, �, �

��, �� is input


Standard output canonical form:

��, �� , ��, �� 1, � � 1�, ��, �� 0��, �� 1, ��, ��1, �� , �� , � 1� ��, 1� � ��

Index displacement vectors:

� → �� → �� → �� → �� → ��#�$ → %�$�

�0, 1��1, 0��0, 0��0, 0��1, 1�

FIR-filter: RIA description

� ��, � � 1��, ��

�0,1�

� 1, 1= �, �

LHS = RHS + IDV


Computational node

&

'

(

)

*

( � +1�&, '�) � +2�&, '�* � +3�&, '�

. � +1+2+3

node g


Computational node from RIA

�� 1, ��

��, � 1�

�� 1, � � 1�

��, ��

��, ��

��, ��

��, �� , ��, �� 1, � � 1�

I(g)

I(g) is the index

vector, i.e., the

sequence of coordinates of g

in index-space


Dependence graphs

1. The nodes of a dependence graph represent (small) computations. There is a separate node for each com-putation.

2. The edges of a dependence graph represent causal dependencies between computations, i.e., an edge from node � to node � indicates that the result of the computation performed by � is used in the computation performed by �.

3. There is no notion of time in a dependence graph. It is an (index-)space representation.


FIR: Dependence graph

x(0) x(1) x(2) x(4)x(3)

y(0) y(1) y(2) y(4)y(3)

h(1)

h(2)

h(0)

�

�

�� 0�� 1�� 1� � ��2�� 2�


FIR: Dependence graph

x(0) x(1) x(2) x(4)x(3)

y(0) y(1) y(2) y(4)y(3)

h(1)

h(2)

h(0)

�

�

�� 0�� 1�� 1� � ��2�� 2�

0

0

0

0 0 0 0


Regular dependence graphs

A dependence graph / is regular when:

1. There is a injective mapping 0 from the

nodes of / to a grid of points in the �-

dimensional index space.

2. There exists a finite set 1 of vectors, called

fundamental edges, such that every pair ��, ��of neighboring nodes is mapped to a pair of

grid locations that differ by a fundamental

edge 2 ∈ 1, i.e., 0 � 0 � � 2.


FIR: DG in space representation

x(0) x(1) x(2) x(4)x(3)

y(0) y(1) y(2) y(4)y(3)

h(1)

h(2)

h(0)

(1,-1)

(1,0)

(0,1)

1 � 24 25|26� � 1 0 10 1 1fundamental edges


Systolic array design

The design of a systolic array for a computation

given in the form of a regular dependence graph

involves:

1. Choosing a processor space, i.e., a set of dimensions

and a number of PEs per dimension (the array).

2. Mapping each computational node of the graph to a

PE of the array.

3. For each PE scheduling the computations of the

nodes mapped onto it, i.e., assigning each individual

computation to a distinct time slot.

Similar to folding


Design parameters

An �� 1)-dimensional systolic design for an

�-dimensional regular dependence graph is

characterized by:

1. A � 7 �� 1� processor space matrix 8:

9:0�� is the processor that executes node �2. A �-dimensional scheduling vector ; :

<:0��is the time slot at which node x is executed

3. A projection (iteration) vector = :

0��– 0�� ?= implies 9:0�� 9@0��


Design constraints

• Computations whose grid locations differ by a multiple of the projection vector execute on the same PE

– 0��– 0�� ?= implies 9:0�� 9@0��– hence 9:= � 0

• Computations that execute on the same PE must be scheduled in different time slots – <:0�� is the time slot at which node � is

executed

– hence ;:= A 0


Processor allocation:

x(0) x(1) x(2) x(4)x(3)

y(0) y(1) y(2) y(4)y(3)

h(1)

h(2)

h(0)

B: ��

pro

ce

sso

rs

): � �1, 0�B: � �0, 1�


Scheduling:

x(0) x(1) x(2) x(4)x(3)

y(0) y(1) y(2) y(4)y(3)

h(1)

h(2)

h(0)

0

0

0

1

1

1

2

2

2

3

3

3

4

4

4

time

): � �1, 0�C: � �1, 0�

C: ��


Hardware Utilization Efficiency (HUE)

Let � and y computations with index vectors 0��, 0��that are executed on the same PE.

• Then 0�� 0�� ?=.

Let $D be the time at which � is scheduled and $Ebe the time at which � is scheduled.

• Then $D $E � ;@�0 � 0 � � � ?;:= F |;:=|.Hence, any PE executes at most 1 computation

per ;:= time slots. So

HUE = 1/|;:=|Question: How do we call ;:= ?


From DG to systolic array

Map a DG onto a systolic array as follows:

• Nodes:

– map x to processing element 9:0��• Edges

– map � → �to connection 9:0 � → 9:0��– insert ;:2 D-elements in this edge, where

2 � 0��– 0��, is a fundamental edge

Note that there are only finitely many fundamental grid

edges (independent of the size DG), and recall that

each edge is a translation of a fundamental edge.


B1: H-stay, X-broadcast, Y-move

2: H:* ;:*� �1, 0� 0 1� �0, 1� 1 0� �1, 1� 1 1

��

� � �

PE PE PE

=: � �1, 0�H: � �0, 1�;: � �1, 0�


B1: H-stay, X-broadcast, Y-move

HUE = 1 / | sTd | = 1

h0 h1 h2

0

x(i)

y(i) v(i) u(i)

y(i) = h0·x(i) + v(i-1), v(i) = h1·x(i) + u(i-1), u(i) = h2·x(i) + 0

y(i) = h0·x(i) + h1·x(i-1) + h2·x(i-2)


Determining 8, ;, and =• Trial-and-error approach

– Pick a combination and check whether the design

constraints are fulfilled.

• Constructive approach 1. Determine a schedule ;.

2. Determine a projection vector = such that ;:= A 03. Let I � =:=0– ==:

. Then I is a matrix of rank

� 1 such that I:= � 0. By sweeping, a zero

column can be created in Q. Drop this column to

obtain a � 7 � 1 -matrix 8.


FIR-designs (Parhi)

sT

dT

pT

pT(eh|ex|ey) s

T(eh|ex|ey)

B1 (1, 0) (1, 0) (0, 1) (0, 1,-1) (1, 0, 1)

F (1, 1) (1, 0) (0, 1) (0, 1,-1) (1, 1, 0)

W1 (2, 1) (1, 0) (0, 1) (0, 1,-1) (2, 1, 1)

W2 (1, 2) (1, 0) (0, 1) (0, 1,-1) (1, 2,-1)

DW2 (1,-1) (1, 0) (0, 1) (0, 1,-1) (1,-1,2)

B2 (1, 0) (1,-1) (1, 1) (1, 1, 0) (1, 0, 1)

R1 (1,-1) (1,-1) (1, 1) (1, 1, 0) (1,-1, 2)

R2 (2, 1) (1,-1) (1, 1) (1, 1, 0) (2, 1, 1)

DR2 (1, 2) (1,-1) (1, 1) (1, 1, 0) (1, 2, -1)

reverse

direction

funda-

mental

edge

ey = -ey

ex = -ex

ex = -ex

ey = -ey


Design R1: dependence graph

X(0) X(1) X(2) X(4)X(3)

y(0) y(1) y(2) y(4)y(3)

h(1)

h(2)

h(0)

(1,-1)

(1,0)

(0,-1)

E = ( eh | -ex | ey) =1 0 1

0 -1 -1fundamental edges


Space-time diagram R1

0 2 4 6 8 10 12

10

8

6

4

2

): � �1, 1�,B: � �1, 1�,C: � �1,1�

J

K

L � B:0 M � � � �N � C:0 M � � �


Processor allocation R1:

X(0) X(1) X(2) X(4)X(3)

y(0) y(1) y(2)

h(1)

h(2)

h(0)

dT = (1, -1)

L � B: �, � :LO)3


Scheduling R1:

X(0) X(1) X(2) X(4)X(3)

h(1)

h(2)

h(0)

dT = (1, -1)

0

1

2

4

2

3

5

6

4

6

7

8

10

8

9

sT = (1, -1)

y(0) y(1) y(2) y(4)y(3)

N � C:��, ��: � 3�� P 3�!�2


R1: H-move, X-move, Y-stay

2: H:2 ;:2� �1, 0� 1 1� �0, 1� 1 1� �1,1� 0 2

=: � �1, 1�H: � �1, 1�;: � �1, 1�

�

�

��

PE PE PE



HUE = 1 / | ;Q= | = 1 / 2

(2-slow)

h1

0

0 0

00

h2

h0

x00x1

01

2

4

5

At time: 0

0

1

3

4

0

2

3

5

0 0

06Y20505Y10

504Y005

0 0 0



0

0 0

00

h2

h0

x00x1

01

2

4

5

0 006Y20

5Y5

Y

XWV

H


N � R S � $T �0 �U 0 0 � \\ 01 0 �U ∗ � 0 0 // 02 �� 0 �U ∗ � x� // 0

3 0 �U ∗ � �� ∗ �� 0 0 \\ 04 � 0 �U ∗ � �� ∗ �� U // 05 0 �U 0 0 // 06 �U 0 �U �\ \\ 07 0 �U ∗ �\ 0 0 // �U8 �� 0 �U ∗ �\ �_ // 09 0 �U ∗ �\ �� ∗ �_ 0 0 \\ 010 � 0 �U ∗ �\ �� ∗ �_ �a // 011 0 �a 0 0 // 012 �U 0 �a �b \\ 013 0 �U ∗ �a 0 0 // �a


Matrix multiplication �c 7 c�: RIA

(��, �� ∑�: 0 � � � c: &��, ��'��, ��f��, �, g� � �∑�: 0 � � � g: &��, ��'��, ��h��, �, g� � &��, g 1�i��, �, g� � '�g 1, ��

f��, �, 0� � 0f��, �, g� � f��, �, g 1� � h��, �, g�i��, �, g�h��, �, g� � h��, � 1, g�i��, �, g� � i�� 1, �, g�+Oj0 � �, �, g � c


i

jk

C

B

A

Dependence graph

for c � 3(Finite!)

10

0

2

0

1

2

1

2


Kung-Leiserson design

• Scheduling vector– ;: � �1, 1, 1�

• Projection vector– =: � �1, 1, 1�

• Projection space matrix

– 9: � 1 0 10 1 1

• HUE = 1/3

2 8Q2 ;Q2

h010

01 1

i100

10 1

f001

11 1


Kung-

Leiserson

(3x3)-matrix

multiplication

systolic array

delay-elements

not drawn: one

on each edge!y = j-k

x = i-k


KL-array

processor

allocation

( binding )

unbalanced

workload


i

jk

C

B

A

Dependence graph

for c � 3

0

1

2

0

0

1

2

1

2d


KL-array

3-slow

schedule

HUE = 1/3

KL-array details

In addition to the previous slides the

following issues must be addressed

• For both A and B there are 5 input streams

• How are the matrix values distributed over them?

• For C there are 5 output stream

• How are the resulting values distributed over them?

• How are results that become available at an internal PE propagated to the border

• How to operate this array for multiple multiplications?

• Flushing old values, can be combined with getting internal results out.



Summary

1. Systolic architectures are attractive for implementation media like VLSI circuits and FPGAs.

2. Starting point for systolic design is a RIA (or a dependence graph).

3. RIAs can be mapped to systolic arrays in a systematic fashion.

4. Mapping uses simple linear algebra techniques.

5. A large variety of designs for a single problem can be obtained.


Exercise (systolic design)1. An OCL system is a system that counts (#), for each window of

size � on its input stream, the number of times the last received value occurs in that window, i.e., for � 1 � �

� � � #�: 0 � � � �: � � � � � � ,where �is the input stream and � the output stream.

a) Derive a RIA (in standard output form) for this system that satisfies the equations

� �, � � � � , 0 � � � �� , � � #�: � � � � � �: � � � � � � , 0 � � � �l �, � � � � � � � , 0 � � � � Note that l��, �� , � ‼!

b) Draw the dependence graph of this RIA for � � 4. (you need to draw only the part with 0 � � � 6).


2. Consider the scheduling, projection and processor vector

a) Construct the systolic array that corresponds to these vectors. You

may assume the existence of a comparator operator that takes

two input streams and produces an output stream of one’s and

zero’s, for equal and unequal input pairs respectively.

b) Determine the slowness of your design.

3. Assume that the time to perform comparison and addition are given by Tcmp = 1ns and Tadd = 3ns, respectively. Give the

maximum throughput and the latency of your design (taking slowness into account). Give the latency both in number of delays and in real time (�C)

Exercise (systolic design)

=

C � 21 , ) � 1

2 , p � 01


4. Next, replace the scheduling vector by sT = (1, 0). Compare the

throughput and latency of the resulting systolic array with that of the one with sT = (2, 1).

5. Consider the design of 4.

a) Eliminate redundant operators, and optimize the throughput by

pipelining. Give the resulting throughput and latency.

b) Next retime the result of a), keeping throughput and latency fixed, to

obtain the minimum number of delays.

Exercise (systolic design)

Documents

VLSI programming Systolic Design - TU/ewsinmak/Education/2IMN35/2IMN35-2016... · 2016-05-18 · 18-May-16 Rudolf Mak TU/e Computer Science Systolic 1 VLSI programming Systolic Design