ELEC692 VLSI Signal Processing Architecture Lecture 6 Array Processor Structure

ELEC692 VLSI Signal Processing Architecture

Lecture 6Array Processor Structure

Introduction• Regular and recursive algorithms are common in DSP

applications.– Repetitive apply identical series of pre-determined

operations– E.g. matrix operations– Regular architectures based on identical processing

elements• Utilizing massive parallel processing and intensive

pipelining– Systems with programmable processors – multiprocessor

systems– Systems with application-specific processors – array

processors• Global clock synchronized array – systolic arrays• Self-timed asynchronous data transfer – wavefront arrays

• Issues – How is the array processor design dependent on the

algorithm?– How is the algorithm best implemented in the array

processors?

What is a systolic array• Computational networks with distributed data

storage and distributed arrangement of processing elements, so that a high number of operations are carried out simultaneously.

• Multiple PEs to maximize processing per memory access

PE

memory10ns

Conventional: 100MOPS (Million Operations Per Second)

PE

memory10ns

Array processor: 400MOPS

PEPE PE

Characteristics of Array processors

• Parallelism– Both data operation and data transfers

• Locality– Connection exists only to directly neighbouring PEs

• Regularity and modularity– Both computation and communications structures

• Processing elements can be simple (e.g. a single addition/multiplication) or complex

• Why call systolic array?– Analogy to the circulatory system with its two phases,

systolic and diastole– System that are entirely pipelined and have periodic

computation and data transfer cycles– Synchronous clocking and control signals

Array structure examples

PE PE PE PE

1D array

PE PE PE

PE PE PE

PE PE PE

2D array

PE PE PE

PE PE PE

PE PE PE

PE PE PE

PE PE PE

PE PE PE

PE PE PE

PE PE PE

PE PE PE

3D array

Drawback of Array processing

• Not all algorithms can be mapped and implemented using systolic array architecture– Only fixed operations and operands to be processed

are fixed prior to run-time– Adaptive algorithms in which the particular operations

and operands depend on the data to be processed, cannot be used.

• Cost in hardware and area is high• Cost in latency

Data dependency• Express the algorithm in inherent dependency

– Single-assignment code and local data dependency– E.g. Matrix multiplication

njibacwhereABCn

kijijij

,11

FOR i:=1 to n Do For j:=1 to n Do BEGIN c(i,j) :=0 For k:=1 to n Do c(i,j) := c(i,j)+a(i,k)*b(k,j); END;

FOR i:=1 to n Do For j:=1 to n Do BEGIN c(i,j,0) :=0 For k:=1 to n Do BEGIN if i=1, then b(1,j,k):=b_in(k,j)

else b(i,j,k):=b(i-1,j,k) if j=1, then a(i,1,k):=a_in(i,k)

else a(i,j,k):=a(i,j-1,k) c(i,j,k) := c(i,j,k-1)+a(i,jk)*b(i,j,k); END; c_out(I,j):=c(i,j,n); END;

Data dependency graph (DG)

• A graph specifies the data dependencies in an algorithm

• E.g. DG for the matrix multiplication

DG of matrix-vector multiplication

nibacwhereAbCn

kkiki

11

Systolic Array Fundamentals

• Systolic architecture are designed by using linear mapping techniques on regular dependency graph (DG).

• Regular Dependency Graph: the presence of an edge in a certain direction at any node in the DG represents presence of an edge in the same direction at all nodes in the DG.

• DG corresponds to space representation no time instance is assigned to any computation

• Systolic architectures have a space-time representation where each node is mapped to a certain PE and is scheduled at a particular time instance.

• Systolic design methodology maps an N-dimensional DG to a lower dimensional systolic architecture

Example of DG

• Space representation for a FIR filter– y(n) = w0x(n)+w1x(n-1)+w2x(n-2)

10 32 54

y(0) y(1) y(2) y(3) y(4) y(5)

w2

w1

w0

x(0) x(1) x(2) x3) x(4) x(5)

j’=jProcessor axis

(0,1)T

(1,0)T

(1,-1)T

Time axis

Linear Mapping Methods

• Design an application-specific array processor architecture for a given algorithm– Satisfy the demands

regarding computational performance and throughput

– Minimize hardware cost– Regular structure for

VLSI implementation

algorithm

Dependency graph

Signal flow graph

architecture

Assignment of the operations to processors and points in time (assignment and scheduling)

Estimation of # of PE• # of PE nPE is generally smaller than the # of nodes in the

dependency graph nDG.

• Important to know both since they specify how many nodes of the DG must be mapped to a PE.

• The processing time of a PE = TPE (including transfer time of register).

• Within TPE, each PE carries out a total of nOP/PE operations.

• The computational rate of the processor array is then given by

• Desired throughput = RT,target and the # of operations per sample nOP/SAMPLE, so we have

PE

PEOPPEARRAYC T

nnR /

,

PEettTPEOP

SAMPLEOPPE TR

n

nn arg,

/

/

Estimation of # of PE

• # of PE can be furthered reduced through pipelining within the PEs. Given np as the # of pipeline stages along the datapath of the PE, then the new T’PE is

• Example– Samples of a colour TV signal (27MHz sampling rate) are to be

transformatted into 8X8 blocks, and each of the blocks is to be multiplied by an 8X8 matrix

– Sampling and matrix coefficents are 8-bits wide– Results of the accumulated product is 19-bit– A PE contains 1 MUL and 1ADD and a register at its output

REGP

REGPEPE T

n

TTT

'

Example of Estimation of # of PE

LccFADREG

LccFADARRAYMA

REGARRAYMAPE

SAMPLEOP

PEOP

TT

TnT

TTT

n

ADDMULn

16

1627)33(

82

),(2

',,

',,,

,

/

/

(2*83 operations for 82 samples)

With intensive pipelining

)9.2'(6.0'

)4.22(8.4

42

'

',,,,

,

nsTn

nsTn

TTT

TTT

PEPE

PEPE

LssFADANDDCELLMA

REGCELLMAPE

Assume L=50ps)

Some definitions

• Projection vector (also called iteration vector)– Two nodes that are displaced by d or multiples of d are

executed by the same processor.

• Scheduling vector: sT=(s1,s2)– Any node with index I would be executed at time STI

• Processor space vector pT=(p1,p2)– Any node with index IT=(i,j) would be exeucted by processor

• Hardware Utilization Efficiency, HUE = 1/|sTd|– This is because two tasks executed by the same processor are

spaced |sTd| time units apart.

2

1

d

dd

j

ippIpT ),( 21

Systolic Array Design Methodology• Systolic architectures are designed by

selecting different project, processor space and scheduling vectors,

• Feasibility constraints– Processor space vector and projection vector must

be orthogonal to each other.• Points A and B differ by the projection vector, i.e. IA-IB is

same as d, then they must be executed by the same processor, i.e. PTIA = PTIB and

• If A and B are mapped to the same processor, then they cannot be executed at the same time, i.e.

• Edge mapping: If an edge e exists in the space representation or DG, then an edge pTe is introduced in the systolic array with sTe delays.

dpIIp TBA

T 0)(

0..., dseiIsIs TB

TA

T

Array Architecture Design

• Step 1: mapping algorithm to DG– Based on the space-time indices in the recursive algorithm– Shift-Invariance (Homogeneity) of DG– Localization of DG: broadcast vs. transmitted data

• Step 2: mapping DG to SFG– Processor assignment: a projection method may be

applied (project vector d)– Scheduling: a permissible linear schedule may be applied

(Schedule vector s)• Preserve the inter-dependence• Nodes on an equitemporal hyperplane should not be

projected to the same PE• Step 3: mapping an SFG onto an array processor

Example: FIR Filter

x(0) x(1) x(2) x(3) x(4) x(5) x(6)

h(0)

h(1)

h(2)

h(3)

h(4)

y(0)

y(1)

y(2)

y(3)

y(4) y(5) y(6)

k

n

d

s

Equitemporal hyperplanes

D

D

D

D

D

2D

2D

2D

2D

2D

D

D

D

D

D

x(0)x(1)x(2)

y(0)y(1)y(2)..

Space time transformation

• Space representation or DG can be transformed to a space-time representation by interpreting one of the spatial dimensions as temporal dimension.

• For a two-dimensional (2D) DG, the general transformation is described by i’=t=0, j’=pTI and t’=sTI or equivalently

t

j

i

s

p

t

j

i

T

t

j

i

T

T

0

0

100

'

'

'

• In the space-time representation, the j’ axis represents the processor axis and t’ represents the scheduling time instance.

FIR Systolic Array (Design B1)• B1 design is derived by selecting projection vector,

processor vector and scheduling vector as follows:

• Any node with index IT=(I,j) is mapped to processor

• Therefore all nodes on a horizontal line are mapped to the same processor

• Any node with index IT=(i,j) is executed at time• Since then

• Edge mapping

).10(),10(,1

0

TT spd

jj

iIpT

10

ij

iIsT

01

10

101

dsT 1

||

1

dsHUE

T

eT pTe sTe

Weight (wt(1 0)) 0 1

Input ( i/p(0 1)) 1 0

Result (1 -1) -1 1

B1 design

D

D

input

D

D

D

D

result

Block diagram

Processoraxis

D

D0

D

D1 D2

Input x(n)

Result0

Processor 1Low-level implementation

Time axis

j’=jProcessor axis x(0) x(1) x(2) x(3) x(4)

y(0) y(1) y(2) y(3) y(4)

0

1

2

0 1 2 3 4

Space-timerepresentationof B1 design

weight

FIR Systolic Array (Design B2)• B2 design is derived by selecting projection vector,

processor vector and scheduling vector as follows:

• Any node with index IT=(I,j) is mapped to processor

• Any node with index IT=(i,j) is executed at time• Since then

• Edge mapping

).01(),11(,1

1

TT spd

j

iIpT 11

ij

iIsT

01

11

101

dsT 1||

1

dsHUE

T

eT pTe sTe


Input ( i/p(0 1)) 1 0

Result (1 -1) 0 1

Weights move instead of the results as in B1.Inputs are broadcast.

B2 design

D

D

input

D

D

D

Dresult

Block diagram

Processoraxis

D

0 1 2

Input x(n) (…x(3),x(2),x(1),x(0)

Low-level implementation

Time axis t’=I

j’=i+jProcessor axis

x(0) x(1) x(2) x(3) x(4)

y(0)

y(1)

y(2)

y(3)0

12

0 1 2 3 4

Space-time representation of B2 design

weight

D

0

D

0

D

0

D D

Applying space-time transformation

ij

ist

jij

ipj

T

T

'

'

FIR Systolic Array (Design F)• F design is derived by selecting projection vector, processor

vector and scheduling vector as follows:

• Since then

• Edge mapping

).11(),10(,0

1

TT spd

10

111

dsT 1

||

1

dsHUE

T

eT pTe sTe


Input ( i/p(0 1)) 1 1

Result (1 -1) -1 0

Weights fixed in space, input vector moves from left to right with 1 delay;Output moves from right to left with no delay elements.

F design

D

input

D

D

D

Dweight

Block diagram

Processoraxis

Low-level implementation

Time axis t’=i+j

j’=jProcessor axis

x(0)x(1)x(2)

y(0) y(1) y(2) y(3)

0

1

2

0 1 2 3 4

Space-time representation of B2 design

result

Applying space-time transformation

jij

ist

jj

ipj

T

T

'

'

D

D

D0

D

D1 D2

Result

Input x(n)

0

Selection of Scheduling Vector

• Finding feasible scheduling vectors using scheduling inequalities.

• Based on the selected scheduling vector ST, the projection vector d and the processor space vector pT can be found by

• Scheduling inequalities:– Consider the dependency relation XY:

– We have

00 dpanddS TT

y

yy

x

xx j

iIY

j

iIX :: Ix and Iy are the indices of node X

and Y.

xxy TSS Where Tx is the time to compute node X and Sx, Sy are the scheduling times for nodes X and Y

Scheduling Equations

• Linear Scheduling

• Affine scheduling ( A transformation followed by a translation

and we have

• Defining edge from XY as ex-y = Iy-Ix, then scheduling inequality for an edge is:

y

yy

Ty

x

xx

Tx

j

issIsS

j

issIsS

)(

)(

21

21

yy

yyy

Ty

xx

xxx

Tx

j

issIsS

j

issIsS

)(

)(

21

21

xxxT

yyT TIsIs

xxyyxT Tes

Scheduling vector sT can be obtained by solving these inequalities

Scheduling Inequalities

• Two steps for select the scheduling vectors:– Capture all the fundamental edges. The

reduced dependence graph (RDG) is used to capture the fundamental edges and the regular iterative algorithm (RIA) description of the corresponding problem is used to construct RDGs

– Construct the scheduling inequalities and solve them for feasible sT.

Regular Iteration Algorithm (RIA)

• Standard input RIA form– If the index of the inputs are the same for all equations.

• Standard output RIA form– If the output indices, i.e. indices on the left side, are the

same

• E.g. FIR filter, RIA description is– W(i+1,j) = W(i,j)– X(i,j+1) = X(i,j)– Y(i+1,j-1)=Y(i,j) + W(i+1,j-1)*X(i+1,j-1)

• Cannot expressed as standard input RIA form, but can be expressed as standard output RIA form– W(i,j) = W(i-1,j)– X(i,j) = X(i,j-1)– Y(i,j)=Y(i-1,j+1) + W(i,j)*X(i,j)

w x

y

(1,0)(0,1)

(1,-1)

(0,0) (0,0) Reduced DG

Example

xxyyxT Tes

2

1

s

sswhere

125,1

1:

0,0

0:

1,0

1:

1,1

0:

0,0

0:

21

1

2

yy

xy

ww

xx

wy

sseYY

eYX

seWW

seXX

eYW

Tmult = 5, Tadd = 2, Tcom = 1w x

y

(1,0)(0,1)

(1,-1)

(0,0) (0,0)

We have 5 edges in the RDG and so

For linear scheduling, x= y=w=0

We have .8,1,1 2121 ssssOne of the solution is s2=1,s1=8+1=9 and sT=(9,1)

Select d=(1,-1) such that sTd=8 0 and select pT = (1,1) so that pTd = 0. HUE = 1/8

Edge mapping = eT pTe sTe


Input ( i/p(0 1)) 1 1

Result (1 -1) 0 8

8D 8D 8D

D D

9D 9D

X

W

Matrix-Matrix Multiplication• DG is a three-dimensional (3D)

space representation• Linear projection is used to

design 2D systolic arrays• Given 2 matrices A and B (e.g.

dimension of 2X2) and C= AB, 2222122122

2122112121

2212121112

2112111111

2221

1211

2221

1211

2221

1211

babac

babac

babac

babac

bb

bb

aa

aa

cc

cc

)1,,()1,,(),,()1,,(

),,(),,1(

),,(),1,(

kjibkjiakjickjic

kjibkjib

kjiakjia

),,(),,()1,,(),,(

),,1(),,(

),1,(),,(

kjibkjiakjickjic

kjibkjib

kjiakjia

Standard output RIA form

Matrix-Matrix Multiplication Example

),,(),,()1,,(),,(

),,1(),,(

),1,(),,(

kjibkjiakjickjic

kjibkjib

kjiakjia

0,

0

0

0

:

0,

0

0

0

:1,

1

0

0

:

0,

0

0

1

:0,

0

1

0

:

3

12

bc

ac

ecb

ecasecc

sebbseaa

a b

c

(0,1,0)(1,0,0)

(0,0,1)

(0,0,0) (0,0,0)

RDG of the matrix multi. example

Let Tmult-add=1 and Tcom = 0, we have

For linear scheduling, a= b=c=0

Solution 1:

1

1

0

0

)111(

0

0

1

0

0

010

001

010

001,

1

0

0

),111(

ds

dp

pds

T

T

TT

HUE = 1

Solution 1

Edge mapping =

eT pTe sTe

a(0,1,0) (0,1) 1

b(1,0,0) (1,0) 1

C(0,0,1) (0,0) 1

cD

cD

cD

cD

cD

cD

cD

cD

cD

a a a

b

b

b

D D

D D

D D

i

j

D

D

D

D

D

D

2-dimensional systolic arrayBy S. Y. Kung

Solution 2

11

1

1

1

)111(

0

0

1

1

1

110

101

110

101,

1

1

1

),111(

HUEds

dp

pds

T

T

TT

Edge mapping =

eT pTe sTe

a(0,1,0) (0,1) 1

b(1,0,0) (1,0) 1

C(0,0,1) (1,1) 1

c c c

Dc D D

Dc D D

a a a

b

b

b

D D

D

D D

D

D D

D D

i

j

Documents

ELEC692 VLSI Signal Processing Architecture Lecture 6 Array Processor Structure