Text of ELEC692 VLSI Signal Processing Architecture Lecture 6 Array Processor Structure
ELEC692 VLSI Signal Processing Architecture Lecture 6 Array Processor Structure
Introduction Regular and recursive algorithms are common in DSP applications. Repetitive apply identical series of pre-determined operations E.g. matrix operations Regular architectures based on identical processing elements Utilizing massive parallel processing and intensive pipelining Systems with programmable processors multiprocessor systems Systems with application-specific processors array processors Global clock synchronized array systolic arrays Self-timed asynchronous data transfer wavefront arrays Issues How is the array processor design dependent on the algorithm? How is the algorithm best implemented in the array processors?
What is a systolic array Computational networks with distributed data storage and distributed arrangement of processing elements, so that a high number of operations are carried out simultaneously. Multiple PEs to maximize processing per memory access PE memory 10ns Conventional: 100MOPS (Million Operations Per Second) PE memory 10ns Array processor: 400MOPS PE
Characteristics of Array processors Parallelism Both data operation and data transfers Locality Connection exists only to directly neighbouring PEs Regularity and modularity Both computation and communications structures Processing elements can be simple (e.g. a single addition/multiplication) or complex Why call systolic array? Analogy to the circulatory system with its two phases, systolic and diastole System that are entirely pipelined and have periodic computation and data transfer cycles Synchronous clocking and control signals
Array structure examples PE 1D array PE 2D array PE 3D array
Drawback of Array processing Not all algorithms can be mapped and implemented using systolic array architecture Only fixed operations and operands to be processed are fixed prior to run-time Adaptive algorithms in which the particular operations and operands depend on the data to be processed, cannot be used. Cost in hardware and area is high Cost in latency
Data dependency Express the algorithm in inherent dependency Single-assignment code and local data dependency E.g. Matrix multiplication FOR i:=1 to n Do For j:=1 to n Do BEGIN c(i,j) :=0 For k:=1 to n Do c(i,j) := c(i,j)+a(i,k)*b(k,j); END; FOR i:=1 to n Do For j:=1 to n Do BEGIN c(i,j,0) :=0 For k:=1 to n Do BEGIN if i=1, then b(1,j,k):=b_in(k,j) else b(i,j,k):=b(i-1,j,k) if j=1, then a(i,1,k):=a_in(i,k) else a(i,j,k):=a(i,j-1,k) c(i,j,k) := c(i,j,k-1)+a(i,jk)*b(i,j,k); END; c_out(I,j):=c(i,j,n); END;
Data dependency graph (DG) A graph specifies the data dependencies in an algorithm E.g. DG for the matrix multiplication
DG of matrix-vector multiplication
Systolic Array Fundamentals Systolic architecture are designed by using linear mapping techniques on regular dependency graph (DG). Regular Dependency Graph: the presence of an edge in a certain direction at any node in the DG represents presence of an edge in the same direction at all nodes in the DG. DG corresponds to space representation no time instance is assigned to any computation Systolic architectures have a space-time representation where each node is mapped to a certain PE and is scheduled at a particular time instance. Systolic design methodology maps an N-dimensional DG to a lower dimensional systolic architecture
Example of DG Space representation for a FIR filter y(n) = w 0 x(n)+w 1 x(n-1)+w 2 x(n-2) 103254 y(0)y(1)y(2)y(3)y(4)y(5) w2w2 w1w1 w0w0 x(0)x(1)x(2)x3)x(4)x(5) j=j Processor axis (0,1) T (1,0) T (1,-1) T Time axis
Linear Mapping Methods Design an application- specific array processor architecture for a given algorithm Satisfy the demands regarding computational performance and throughput Minimize hardware cost Regular structure for VLSI implementation algorithm Dependency graph Signal flow graph architecture Assignment of the operations to processors and points in time (assignment and scheduling)
Estimation of # of PE # of PE n PE is generally smaller than the # of nodes in the dependency graph n DG. Important to know both since they specify how many nodes of the DG must be mapped to a PE. The processing time of a PE = T PE (including transfer time of register). Within T PE, each PE carries out a total of n OP/PE operations. The computational rate of the processor array is then given by Desired throughput = R T,target and the # of operations per sample n OP/SAMPLE, so we have
Estimation of # of PE # of PE can be furthered reduced through pipelining within the PEs. Given n p as the # of pipeline stages along the datapath of the PE, then the new T PE is Example Samples of a colour TV signal (27MHz sampling rate) are to be transformatted into 8X8 blocks, and each of the blocks is to be multiplied by an 8X8 matrix Sampling and matrix coefficents are 8-bits wide Results of the accumulated product is 19-bit A PE contains 1 MUL and 1ADD and a register at its output
Example of Estimation of # of PE (2*8 3 operations for 8 2 samples) With intensive pipelining Assume L =50ps)
Some definitions Projection vector (also called iteration vector) Two nodes that are displaced by d or multiples of d are executed by the same processor. Scheduling vector: s T =(s 1,s 2 ) Any node with index I would be executed at time S T I Processor space vector p T =(p 1,p 2 ) Any node with index I T =(i,j) would be exeucted by processor Hardware Utilization Efficiency, HUE = 1/|s T d| This is because two tasks executed by the same processor are spaced |s T d| time units apart.
Systolic Array Design Methodology Systolic architectures are designed by selecting different project, processor space and scheduling vectors, Feasibility constraints Processor space vector and projection vector must be orthogonal to each other. Points A and B differ by the projection vector, i.e. I A -I B is same as d, then they must be executed by the same processor, i.e. P T I A = P T I B and If A and B are mapped to the same processor, then they cannot be executed at the same time, i.e. Edge mapping: If an edge e exists in the space representation or DG, then an edge p T e is introduced in the systolic array with s T e delays.
Array Architecture Design Step 1: mapping algorithm to DG Based on the space-time indices in the recursive algorithm Shift-Invariance (Homogeneity) of DG Localization of DG: broadcast vs. transmitted data Step 2: mapping DG to SFG Processor assignment: a projection method may be applied (project vector d) Scheduling: a permissible linear schedule may be applied (Schedule vector s) Preserve the inter-dependence Nodes on an equitemporal hyperplane should not be projected to the same PE Step 3: mapping an SFG onto an array processor
Example: FIR Filter x(0) x(1) x(2) x(3)x(4) x(5) x(6) h(0) h(1) h(2) h(3) h(4) y(0) y(1) y(2) y(3) y(4)y(5)y(6) k n d s Equitemporal hyperplanes D D D D D 2D D D D D D x(0) x(1) x(2) y(0) y(1) y(2).
Space time transformation Space representation or DG can be transformed to a space-time representation by interpreting one of the spatial dimensions as temporal dimension. For a two-dimensional (2D) DG, the general transformation is described by i=t=0, j=p T I and t=s T I or equivalently In the space-time representation, the j axis represents the processor axis and t represents the scheduling time instance.
FIR Systolic Array (Design B1) B1 design is derived by selecting projection vector, processor vector and scheduling vector as follows: Any node with index I T =(I,j) is mapped to processor Therefore all nodes on a horizontal line are mapped to the same processor Any node with index I T =(i,j) is executed at time Since then Edge mapping eTeT pTepTesTesTe Weight (wt(1 0))01 Input ( i/p(0 1))10 Result (1 -1)1
B1 design D D input D D D D result Block diagram Processor axis D D 00 D D 11 D 22 Input x(n) Result 0 Processor 1 Low-level implementation Time axis j=j Processor axis x(0)x(1)x(2)x(3)x(4) y(0)y(1)y(2)y(3)y(4) 00 11 22 0 1234 Space-time representation of B1 design weight
FIR Systolic Array (Design B2) B2 design is derived by selecting projection vector, processor vector and scheduling vector as follows: Any node with index I T =(I,j) is mapped to processor Any node with index I T =(i,j) is executed at time Since then Edge mapping eTeT pTepTesTesTe Weight (wt(1 0))11 Input ( i/p(0 1))10 Result (1 -1)01 Weights move instead of the results as in B1. Inputs are broadcast.
B2 design D D input D D D D result Block diagram Processor axis D 00 11 22 Input x(n) (x(3),x(2),x(1),x(0) Low-level implementation Time axis t=I j=i+j Processor axis x(0)x(1)x(2)x(3)x(4) y(0) y(1) y(2) y(3) 00 11 22 0 1234 Space-time representation of B2 design weight D 0 D 0 D 0 DD Applying space-time transformation
FIR Systolic Array (Design F) F design is derived by selecting projection vector, processor vector and scheduling vector as follows: Since then Edge mapping eTeT pTepTesTesTe Weight