DESIGN & IMPLEMENTATION OF SYSTOLIC ARRAY .... Design...DESIGN & IMPLEMENTATION OF SYSTOLIC ARRAY ARCHITECTURE SWETA SINGH1 & N B SINGH2 1Worked on M.Tech (VLSI Design), CSIR-CEERI,

DESIGN & IMPLEMENTATION OF SYSTOLIC ARRAY ARCHITECTURE

SWETA SINGH1 & N B SINGH

2

1Worked on M.Tech (VLSI Design), CSIR-CEERI, Banasthali Vidyapeeth University, Vanasthali, Rajasthan, India

2Chief Scientist, MEMS MS & RF ICs Design, CSIR-Central Electronics Engineering Research Institute (CEERI), Pilani,

Rajasthan, India

ABSTRACT

The paper describes the implementation of 2-D systolic array matrix multiplier architecture in RTL using one

dimensional array to target the design on a appropriate FPGA/PROM/CPLD devices. It also discusses the digital realisation

of a binary multiplier. The system development started with top-down planning approach and the blocks were designed

using bottom-up implementation. The programs were written, simulated and synthesized using Mentor Graphics tools,

ModelSim and Leonardo Spectrum. Results are presented in the paper. The design presented in the paper is an integral part

of the higher level efficient systolic architecture.

KEYWORDS: Systolic Array, DSP, Verilog and HDL

INTRODUCTION

A Parallel Algorithm [5], is an algorithm which can be executed a piece at a time on many different processing

devices, and then put back together again at the end to get the correct result. Parallel algorithms[8], are valuable because of

substantial improvements in multiprocessing systems and the rise of multi-core processors. In general, it is easier to

construct a computer with a single fast processor than one with many slow processors with the same throughput. But

processor speed is increased primarily by shrinking the circuitry, and modern processors are pushing physical size and heat

limits. These twin barriers have flipped the equation, making multiprocessing practical even for small systems.

Modelling parallel algorithm is more complicated than modelling sequential algorithm because in practice parallel

computers tend to vary more in organization than do sequential computers. As a consequence, a large portion of the

research on parallel algorithms has gone into the question of modelling, Although there has been no consensus on the right

model, this research has yielded a better understanding of the relationship between the models. Any discussion of parallel

algorithms requires some understanding of the various models and the relationship among them.

In many situations, hardware description languages (HDL) such as VHDL, Verilog or SystemC is used to develop

the functionality of the digital system, while the timing and control signal generation is either neglected or ignored. I have

used a methodology wherein a hardware structure was conceptually laid out of the digital system under consideration. The

system development started with top-down planning approach and the blocks were designed using bottom-up

implementation. The programs were written, simulated and synthesized using Electronic Data Automation (EDA) tools

such as ModelSim and Leonardo Spectrum. Instruction set such as transfer, arithmetic, logic, input, output and control

instructions were implemented..

Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966. According

to Flynn’s taxonomy parallelism can be established by using one of these models,

SISD has no parallelism. It is used in sequential computing system

International Journal of Electrical and

Electronics Engineering Research (IJEEER)

ISSN 2250-155X

Vol. 3, Issue 4, Oct 2013, 117-128

© TJPRC Pvt. Ltd.

118 Sweta Singh & N B Singh

SIMD establishes data parallelism.

MISD establishes instructions parallelism.

MIMD establishes both data and instructions parallelism.

Flynn’s taxonomy distinguishes multi-processor computer architectures according to how they can be classified

along the two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible

states: Single or Multiple. The matrix below defines the 4 possible classifications according to Flynn:

Figure 1: Flynn’s Taxonomy Classifications

From scalar to superscalar, the simplest processors are scalar processors. Each instruction executed by a scalar

processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector

processor operates simultaneously on many data items. An analogy is the difference between scalar and vector arithmetic.

A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple

redundant functional units within each CPU thus multiple instructions can be processing separate data items concurrently.

A superscalar architecture where a processor having multi functional units can be realized to execute multi

instructions in a single control step to redundant functional units on the processor when it is also integrated with pipeline

features then it maintains its enhanced feature. Superscalar has the redundant resources. Existing binary executable

programs have varying degrees of intrinsic parallel During the project above classified models will be included to establish

the architecture for the parallel processor. The applications may be the parallel ALU and array processors, i.e. systolic

array architecture.

Most of today’s algorithms are sequential, they specify a sequence of steps in which each step consists of a single

operation. These algorithms are well suited to today’s computers, which basically perform operations in a sequential

fashion. Although the speed at which sequential computers operate has been improving at an exponential rate for many

years, the improvement is now coming at greater and greater cost. As a consequence, researchers have sought more cost-

effective improvements by building parallel computers that perform multiple operations in a single step.

In order to solve a problem efficiently on a parallel machine, it is usually necessary to design an algorithm that

specifies multiple operations on each step, i.e., a parallel algorithm. As an example, consider the problem of computing the

sum of a sequence A of n numbers.

The standard algorithm computes the sum by making a single pass through the sequence, keeping running sum of

the numbers seen so far. It is not difficult however, to devise an algorithm for computing the sum that performs many

operations in parallel. For example, suppose that, in parallel, each element of A with an even index is paired and summed

with the next element of A, which has an odd index, i.e., A[0] is paired with A[1], A[2] with A[3], and so on. The result is

a new sequence of [n/2] numbers that sum to the same value as the sum that is wish to compute. This pairing and summing

Design & Implementation of Systolic Array Architecture 119

step can be repeated until, after [log2n] steps, a sequence consisting of single value is produced, and this value is equal to

the final sum.

The parallelism in an algorithm can yield improved performance on many deferent kinds of computers. For

example, on a parallel computer, the operations in a parallel algorithm can be performed simultaneously by deferent

processors. Furthermore, even on a single-processor computer the parallelism in an algorithm can be exploited by using

multiple functional units, pipelined functional units, or pipelined memory systems.

Thus, it is important to make a distinction between the parallelism in an algorithm and the ability of any particular

computer to perform multiple operations in parallel. Of course, in order for a parallel algorithm to run efficiently on any

type of computer, the algorithm must contain at least as much parallelism as the computer, for other-wise resources would

be left idle. Unfortunately, the converse does not always hold: some parallel computers cannot efficiently execute all

algorithms, even if the algorithms contain a great deal of parallelism. Experience has shown that it is more difficult to build

a general-purpose parallel machine than a general-purpose sequential machine.

OBJECTIVE(S) AND SCOPE

Systolic array has been modelled in Verilog Hardware Description Language, which is small integral part of for

full search block matching algorithm (FSBMA) for motion estimation and compensation [10-11], which leads to video

sequence compression is realized using systolic array architectures. The objective of working on this project is to

implement 2D systolic array in which 3*3 matrix multiplications can be performed. It is used in FSBMA. A large number

of systolic array designs have been developed and used to perform a broad range of computations. In fact, recent advances

in theory and software have allowed some of these systolic arrays to be derived automatically. The following is a

representative list of computations for which systolic designs exist.

Signal and Image Processing: Digital filters, convolution and correlation, discrete Fourier transform, fast Fourier

transform (FFT--q.v.), encoding/ decoding for compression.

Matrix Arithmetic: Matrix multiplication, solution of linear systems of equations, solution of Toeplitz linear

systems, QR-decomposition, least-squares computation, singular value decomposition, eigenvalue computation,

etc.

Technology Used

To model 2D systolic array matrix multiplication ModelSim is used for compilation and simulation. Leonardo

Spectrum is used to obtain Synthesis Report, RTL implementation as well as to View Technology for Xilinx Virtex-II Pro

FPGA, PROM & CPLD devices.

METHODOLOGY

The Systolic Array [1-4] is this design in an integral part of the main processor. The Systolic portion of the

Processor is treated as an array of ALUs and it is controlled in very much the same way as a Scalar ALU[6-7].

Systolic arrays are a family of parallel computer architectures capable of using a very large number of processors

simultaneously for important computations in applications such as scientific computing and signal processing. This article

gives a general description of systolic arrays, illustrates the idea by two simple examples, lists some applicable

computations, and describes fine-grain inter processor communication in systolic arrays.

Systolic arrays are suited for processing repetitive computations. Although this kind of computation usually


requires a great deal of computing power, such computations are highly regular and parallelizable. The systolic array

architecture exploits this regularity and parallelism to deliver the required computational speed. Being able to perform

many operations simultaneously is just one of the many advantages of systolic arrays. Other advantages include modular

expandability of the cell array, simple and regular data and control flows, simple and uniform cells, efficient fault-tolerant

schemes, and nearest-neighbor data communications.

These properties are highly desirable for VLSI (Very Large-Scale Integration) implementations. Indeed, the

advances in VLSI technology have been a major motivation for much interest in systolic arrays.

A systolic array is an arrangement of processors in an array where data flows synchronously across the array

between neighbours, usually with different data flowing in different directions[9]. Each processor at each step takes in data

from one or more neighbours (e.g. North and West), processes it and, in the next step, outputs results in the opposite

direction (South and East). Systolic arrays are specialized form of parallel computing, where processors connected by short

wires. An example of two dimensional systolic array is given in the Figure 2 given below.

Figure 2: Architecture of Systolic Array [9]

The array given above takes in inputs parallel performs parallel processing and outputs the result. Systolic arrays

do not lost their speed duo to their connection unlike any other parallelism. Cells i.e. Processing Elements (PE), compute

data and store it independently of each other. Each cell (PE) is an independent processor and has some registers and

Arithmetic and Logic Units (ALUs).

The cells (Processing Elements) share information with their neighbours, after performing the needed operations

on the data. For example, when multiplying two 3*3 matrix we need N3 operations according to the given formula:

For I = 1 to 3

For J = 1 to 3

For K = 1 to 3

P[I,J] = P[I,J] + A[J,K] * B[K,J];

End

End

End

But using systolic arrays [9] it can be done in only 9 clock pulses.


Figure 3: 3x3 Systolic Array Architecture [9]

Example of Systolic Array is shown in the Figure 3 above. Here each cell takes in inputs from top and left,

multiplies those two number and stores in the local register which is inside the each Processing Element. After 9 clock

pulses the result would be stored in each processing elements. In the full search block matching it needs N2 subtractions,

N2 magnitude operations and N2 magnitude accumulations are needed. Hence systolic arrays can be used to perform these

operations duo to its advantageous properties like regularity, modularity and local communication. The value of each cell

which is stored in local register can be given as follows

P1= a11b11+ a12b21 + a13b31

P2= a11b12+ a12b22 + a13b32

P3= a11b13+ a12b23 + a13b33

P4= a21b11+ a22b21 + a23b31

P5= a21b12+ a22b22 + a23b32

P6= a21b13+ a22b23 + a23b33

P7= a31b11+ a32b21 + a33b31

P8= a31b12+ a32b22 + a33b32

P9= a31b13+ a32b23 + a33b33

SYSTOLIC ARRAY ARCHITECTURE IMPLEMENTATION

Table 1: Operation Executed w.r.t. Clock Performed on Systolic Array

Clock Steps P1 P2 P3 P4 P5 P6 P7 P8 P9

0 - - - - - - - - -

1 a11b11 - - - - - - - -

2 a12b21 a11b12 - a21b11 - - - - -

3 a13b31 a12b22 a11b13 a22b21 a21b12 - a31b11 - -

4 - a13b32 a12b23 a23b31 a22b22 a21b13 a32b21 a31b12 -

5 - - a13b33 - a23b32 a22b23 a33b31 a32b22 a31b13

6 - - - - - a23b33 - a33b32 a32b23

7 - - - - - - - - a33b33

8 - - - - - - - - -

In this operation, A and B are two 8bit inputs and P is the output. Their sequence of appearing A and B are shown

in the form of a and b respectively.

A = (a33a32a31a23a22a21a13a12a11) , B = (b33b32b31b23b22b21b13b12b11)

Where, A = (010101010) B = (101010101) P = (011010010)


Table 2: 3x3 Systolic Array Stepwise Simulation Results

Clock Steps P1 P2 P3 P4 P5 P6 P7 P8 P9

0 - - - - - - - - -

1 0.1 - - - - - - - -

2 1.0 0.0 - 1.1 - - - - -

3 0.1 1.1 0.1 0.0 1.0 - 0.1 - -

4 - 0.0 1.0 1.1 0.1 1.1 1.0 0.0 -

5 - - 0.1 - 1.0 0.0 0.1 1.1 0.1

6 - - - - - 1.1 - 0.0 1.0

7 - - - - - - - - 0.1

8 - - - - - - - - -

Sum 0 1 0 0 1 0 1 1 0

Cout 1 0 1 0

Figure 4: 3x3 Systolic Array System View by M.G. Leonardo Spectrum

Synthesis Resources Summary:

*******************************************************

Cell: systolic3x View: INTERFACE Library: work

*******************************************************

Total Accumulated Area

Number of BUFGP 1

Number of Dffs or Latches 9

Number of Function Generators 61

Number of IBUF 18

Number of OBUF 9

Number of accumulated instances 98

Number of ports 28

Number of nets 117

Number of instances 98

Number of references to this view 0

Cell Library References Total Area

BUFGP xcv2p 1 x 1 1 BUFGP

FD xcv2p 9 x 1 9 Dffs or Latches

IBUF xcv2p 18 x 1 18 IBUF

LUT2 xcv2p 11 x 1 11 Function Generators



OBUF xcv2p 9 x 1 9 OBUF

xcv2p - -

xcv2p - -

Number of global buffers used: 1

***************************************

Device Utilization for 2VP2fg256

***************************************


Resource Used Avail Utilization

IOs 27 140 19.29%

Global Buffers 1 16 6.25%

Function Generators 61 2816 2.17%

CLB Slices 31 1408 2.20%

Dffs or Latches 9 3236 0.28%

Block RAMs 0 12 0.00%

Block Multipliers 0 12 0.00%

Using wire table: xcv2p-2-7_wc

Clock Frequency Report

Clock : Frequency

clk : 71.8 MHz

Figure 5: 3x3 Systolic Array Schematic Interface_XRTL View

Figure 6: 3x3 Systolic Array Technology Schematic

here, X = (010101010) Y = (101010101) Z = XxY = (011010010)


Figure 7: 3x3 Systolic Array System Simulation Window

5. 9bit*9bit Binary Multiplier Implementation:

Mentor Graphics Leonardo Spectrum Synthesis Report for 9x9 Binary Multiplier RTL Design.

Figure 8: 9x9 Bit Binary 1-d Array Efficient Multiplier Technology View

X = (x8 x7 x6 x5 x4 x3 x2 x1 x0) and Y = (y8 y7 y6 y5 y4 y3 y2 y1 y0)

Z= X*Y

Synthesis Summary Report:

*******************************************************

Cell: bin_mult_9bit View: INTERFACE Library: work

*******************************************************

Total Accumulated Area

Number of Function Generators 219

Number of IBUF 18

Number of MUXF5 5

Number of OBUF 18

Number of accumulated instances 260

Number of ports 37

Number of nets 278

Number of instances 260

Number of references to this view 0

Cell Library References Total Area

IBUF xcv2p 18 x 1 18 IBUF




MUXF5 xcv2p 5 x 1 5 MUXF5

OBUF xcv2p 18 x 1 18 OBUF


Number of global buffers used: 0

***********************************************

Device Utilization for 2VP2fg256

***********************************************

Resource Used Avail Utilization

IOs 37 140 26.43%

Global Buffers 0 16 0.00%

Function Generators 219 2816 7.78%

CLB Slices 110 1408 7.81%

Dffs or Latches 0 3236 0.00%

Block RAMs 0 12 0.00%

Block Multipliers 0 12 0.00%

Using wire table: xcv2p-2-7_wc

Figure 9: 9x9 Bit Binary 1-d Array Multiplier Interface_XRTL Schematic View

Figure 10: 9x9 Bit Binary 1-d Array Multiplier Technology Schematic View

A = (a8 a7 a6 a5 a4 a3 a2 a1 a0) and B = (b8 b7 b6 b5 b4 b3 b2 b1 b0)

A = (0 1 0 1 0 1 0 1 0) and B = (1 0 1 0 1 0 1 0 1)

P= A*B = (1110001001110010)


Table 3: Partial Product of Multiplier

Pi Expressions Values Pi* Pi

P0 a0b0 0.1 0 0

P1 a0b1+a1b0 0.0+1.1 1 1

P2 a0b2+a1b1+a2b0 0.1+1.0+0.1 0 0

P3 a0b3+a1b2+a2b1+a3b0 0.0+1.1+0.0+1.1 10 0

P4 a0b4+a1b3+a2b2+a3b1+a4b0 0.1+1.0+0.1+1.0+0.1 0 1

P5 a0b5+a1b4+a2b3+a3b2+a4b1+a5b

0 0.0+1.1+0.0+1.1+0.0+1.1 11 1


1+a6b0

0.1+1.0+0.1+1.0+0.1+1.0+

0.1 0 1

p7 a0b7+a1b6+a2b5+a3b4+a4b3+a5b

2+a6b1+a7b0

0.0+1.1+0.0+1.1+0.0+1.1+

0.0+1.1 100 0


3+a6b2+a7b1+a8b0

0.1+1.0+0.1+1.0+0.1+1.0+

0.1+1.0+0.1 0 0


3+a7b2+a8b1

1.1+0.0+1.1+0.0+1.1+0.0+

1.1+0.0 100 1


3+a8b2

0.1+1.0+0.1+1.0+0.1+1.0+

0.1 0 0


3 1.1+0.0+1.1+0.0+1.1+0.0 100 0

P12 a4b8+a5b7+a6b6+a7b5+a8b4 0.1+1.0+0.1+1.0+0.1 0 0

P13 a5b8+a6b7+a7b6+a8b5 1.1+0.0+1.1+0.0 11 1

P14 a6b8+a7b7+a8b6 0.1+1.0+0.0 0 1

P16 a7b8+a8b7 1.1+0.0 1 1

P17 a8b8 0.1 0 0

Where, Pi* = Pi+2,i+1,i = Pi + Ci-1 and Pi is the Partial Products of the binary multiplier.

Here, X = (010101010) Y = (101010101) Z = X*Y = 010000100111011010

Figure 11: Binary Multiplier System Simulation Window

CONCLUSIONS

Implementation of efficient two-dimensional Systolic Array Matrix Multiplication and Binary Multiplier in MAC

architecture using one dimensional input and output arrays were presented in the paper , its realisation is carried out in

Verilog and the simulation results presented in the simulation window of Modelsim, post synthesis simulation is also

performed. Synthesis reports were also included in the paper. Parallel architecture simulation is performed in HDL using

fork and join statements for parallel ALU operations in verilog.


REFERENCES

1. Lang, T., and Moreno, J. H. "Matrix Computations on Systolic-type Meshes," Computer, 23, 4 (April), 32-

51,1990. Begins with an excellent tutorial on systolic parallel processing.

2. Quinton, P., Robert, Y., and Craig, I. Systolic Algorithms & Architectures. Upper Saddle River, NJ; Prentice

Hall.1991. Evans, D. J. (ed.) Systolic Algorithms. London: Gordon & Breach.

3. Gruska, J. Systolic Computation. New York: Springer-Verlag 1992. Megson, G. M. An Introduction to Systolic

Algorithm Design. Oxford: Oxford Science Publications.

4. Moreno, J. H., and Lang, T. Matrix Computations on Systolic-Type Arrays. 1992, New York:, Kluwer-Academic

Press.

5. Petkov, N. Systolic Parallel Processing, 1993, Amsterdam: North-Holland.

6. Jan M. Rabaey, “Digital Integrated Circuits”. Prentice-Hall of India

7. Douglas A. Pucknell, and Kamran Eshraghian, “Basic VLSI Design”. Third edition, PHI

8. http://en.wikipedia.org/wiki/Parallel_algorithm

9. Jonathan Break, “Systolic Arrays & Their Applications”, http://www.cs.ucf.edu/ courses/ cot4810/fall04/

presentations/Systolic_Arrays.ppt.

10. Mohammad Mahdi Azadfar, Implementation of A Optimized Systolic Array Architecture for FSBMA using

FPGA for Real-time Applications, IJCSNS International Journal of Computer Science and Network Security,

VOL.8 No.3, March 2008.

11. ITU-T H.264/Advanced video coding for generic audio visual Services, Infrastructure of audiovisual services –

Coding of moving video ITU-T Recommendation H.264,2005

Documents

DESIGN & IMPLEMENTATION OF SYSTOLIC ARRAY .... Design...DESIGN & IMPLEMENTATION OF SYSTOLIC ARRAY ARCHITECTURE SWETA SINGH1 & N B SINGH2 1Worked on M.Tech (VLSI Design), CSIR-CEERI,