Fast Communication for Multi – Core SOPC

1

FastFast Communication Communication

for Multi – Core SOPCfor Multi – Core SOPC

Technion – Israel Institute of TechnologyDepartment of Electrical EngineeringHigh Speed Digital Systems Lab

Supervisor: Evgeny Fiksman

Performed by:Moshe BinoAlex Tikh

Spring 2007

End Project End Project PresentationPresentation

One year project

2

Table of ContentTable of Content

IntroductionIntroduction Hardware DesignHardware Design Software DesignSoftware Design Debug ProcessDebug Process ResultsResults Future ResearchFuture Research

Table of Content

3



Table of Content

4

Problem statementProblem statement Single CPU is reaching its technological limits, Single CPU is reaching its technological limits,

e.g. heat dissipation and cost/power ratio.e.g. heat dissipation and cost/power ratio. Thus parallel computing evolved, utilizing multi Thus parallel computing evolved, utilizing multi

core processor paradigm.core processor paradigm. Three major inter-communication techniques are:Three major inter-communication techniques are:

Shared memoryShared memory Remote procedure calls.Remote procedure calls. Message passing – (MPI)Message passing – (MPI)

Introduction

5

Project OverviewProject Overview

Introduction

R

#1

R

#2

R

#3

R

#4

R

#1

R

#4

R

#1

R

#2

R

#1

R

#2

R

#3

R

#2

R

#3

R

#4

R

#3

R

#4

R

R

R

R

RR RR R

Comm #1 Comm #2

Comm #3 Comm #4

FPGA

ROUTER

IP COREMODULE

BIDIRECTIONALLINK

Rank no’ in COMM

• Mesh topology NoC• Routing nodes

• Leaf processor cores

• MPI logically defines clusters• Comm

• Rank

• Cores amount is limited

only by chip resources

6

The following components are to be implemented:The following components are to be implemented: Quad core system.Quad core system. NoC router (4 ports) and infrastructure for fast NoC router (4 ports) and infrastructure for fast

communication in multi-core system.communication in multi-core system. Chosen MPI functions written in C.Chosen MPI functions written in C. Software application demonstrating the Software application demonstrating the

advantages of a parallel system (written in C).advantages of a parallel system (written in C).

Project goalsProject goals

Introduction

7

Block diagramBlock diagram

Introduction

#3

#1

#4#2

FSL BUS FSL BUS

FS

L B

US

FS

L B

US

MPIROUTER

LMB

BU

S

OP

B B

US

I/O

LMB BUS

LMB

BU

S

* OPB – On Chip Peripheral Bus* FSL – Fast Simplex Link* LMB – Local Memory Bus

LMB

BU

S

CLKMuktiplier

x1x4

Rou

ter

MB

CLK

Int Hdler

Int Hdler

Int Hdler

Int Hdler

MEMORY

MEMORY

MEMORY

MEMORY

• Multi – core IP’s• Bi - directional link FSL• Local memory• Main core connected to I/O• Multi - clock domain

• System on programmable

chip implemented on FPGA

8

Constrains:Constrains: FPGA (V2P) maximum clock frequency 400MHz.FPGA (V2P) maximum clock frequency 400MHz. MicroBlaze (MB) core maximum frequency 100MHz.MicroBlaze (MB) core maximum frequency 100MHz. Router maximum frequency 200MHzRouter maximum frequency 200MHz Processors Memory size 64kbyte. (code + data).Processors Memory size 64kbyte. (code + data). Processor to FSL access time - 3 clock cycles.Processor to FSL access time - 3 clock cycles. Maximum FSL buffer depth is 128 - equals 0.5kbyte.Maximum FSL buffer depth is 128 - equals 0.5kbyte. Interrupt handle time - 20 clock cycles Interrupt handle time - 20 clock cycles (no interrupts nesting).(no interrupts nesting). Measurement system:Measurement system: Router works at 100MHz frequency.Router works at 100MHz frequency. MB works at 25MHz frequency.MB works at 25MHz frequency. FSL depth is 64 word - 0.25kbyte FSL depth is 64 word - 0.25kbyte Router is designed for relatively small messages – max. Router is designed for relatively small messages – max.

1kbyte due to processors & FPGA chip memory size. 1kbyte due to processors & FPGA chip memory size.

System specificationsSystem specifications

Introduction

9



Table of Content

10

CROSS BAR

FSL

FSL

MicroB

laze #4

H/TCtrl Bit

FSL

FSL

MicroBlaze #1

H/T

Ctrl B

it

FSL

FSL

MicroB

laze #2

H/TCtrl Bit

FSL

FSL

MicroBlaze #4

H/TCtrl Bit

Router ImplementationRouter Implementation

Hardware Design

11

CROSS – BAR CROSS – BAR

Hardware Design

Cross Bar – Low Level

Clk Rst

Req

Des

t

Prem

it

Des

t

Pre

mit

Req

Dest

Premit

Req

Dest

Premit

Control B

us II

Control Bus II

Control Bus II

Permission Unit

Port

Controls3

Timer & Enable Unit

Control Bus I

Control Bus I

Data Bus 32 Bits

Data Bus 32 Bits

Data B

us

Data B

us

2

Bus I Interface Port2

Bus I Interface

Port2

Bus I Interface

Bus

I In

terfa

ceP

ort 2

Port2

Fsl_S_D

ata

Fsl_

M_D

ata

Port #3 FSM

Fsl_

S_R

ead

Fsl_

S_C

ontro

l

Fsl_

S_H

asD

ata

TO\FROM FSL

Fsl_M_W

rite

Fsl_M_C

ontrol

Fsl_M_Full

Bus II & Data Bus Interface

Port

2

Fsl_S_Data

Fsl_M_Data

Por

t #2

FSM

Fsl_S_Read

Fsl_S_Control

Fsl_S_HasData

TO\F

RO

M F

SL

Fsl_M_Write

Fsl_M_Control

Fsl_M_Full

Port2

Fsl_

S_D

ata

Fsl_M_D

ata

Port #1 FSMFsl_S

_Read

Fsl_S_C

ontrol

Fsl_S_H

asData

TO\FROM FSL

Fsl_

M_W

rite

Fsl_

M_C

ontro

l

Fsl_

M_F

ull

Por

t2

Fsl_S_Data

Fsl_M_Data

Port #4 FS

M

Fsl_S_Read

Fsl_S_Control

Fsl_S_HasData

TO\FR

OM

FSL

Fsl_M_Write

Fsl_M_Control

Fsl_M_Full

Port2

Bus

II &

Dat

a

Bus

Inte

rface

Bus II &

Data

Bus Interface

Bus II & Data Bus Interface

Dest2

Dest

2

Dest2

Des

t2

Dest2

COMM COMM

CO

MM

CO

MM

Bcast

Bca

st

Bcast

Bca

stR

eq

BcastPriority

• Two main units: Permission Unit Port FSM

• Time limited

Round Robin arbiter

• Port to Port & broadcasting

• Smart Connectivity• R – R• R - Core

• Modular design

12



Table of Content

13

System software LayersSystem software Layers

• Application• MPI functions interface

• Network• hardware independent

implementation

• Data• relies on message

structure

• Physical• designed for FSL bus

Design modularity in hardware and software

Physical

Data

Network

Application

MicroBlaze#1

MPI

Physical

Data

Network

Application

MicroBlaze#2

Software design

14

MPI MPI Functions setFunctions setSome of the implemented functions are Some of the implemented functions are trivial, and required by MPI standard.trivial, and required by MPI standard.

MPI_InitMPI_Comm_rankMPI_Comm_sizeMPI_Finalize

Non-trivial functions, used for inter-Non-trivial functions, used for inter-processors communication.processors communication.

MPI_SendMPI_RecvMPI_Bcast

Three additional complimentary Three additional complimentary functions.functions.Supply additional info about the Supply additional info about the received message.received message.

MPI_Get_sourceMPI_Get_countMPI_Get_tag

Software design

15



Table of Content

16

InputMessages

File

OutputMessages

File

OutputMessages

File

Write ToFSL

Read FromFSL

Fsl_out_data

Fsl_in_data

Test Bench

• The Test Bench write

messages to the FSL pipe

(MB output side) and reads

messages from the pipe (MB

input side).

• Signals can also be viewed

in ModelSim

Debug – SimulationDebug – Simulation

Debug Process

textR

Gen

Gen

Gen

Gen

HW – Simulation Env.

17

Debug process – Real time Debug process – Real time

Software debug using 2 MB system.Software debug using 2 MB system.

Debug mainly done using printf function Debug mainly done using printf function for plotting results to monitor trough for plotting results to monitor trough UART.UART.

Hardware debug was using chip scope Hardware debug was using chip scope application and LEDs for indicationapplication and LEDs for indication

Core# 1

SW – Development Env.

Router

Core# 2

Debug Process

18



Table of Content

19

Example applicationExample applicationMatrix - Vector multiplicationMatrix - Vector multiplication

Typical example of highly parallel application.Typical example of highly parallel application. Root processor broadcasts Root processor broadcasts VectorVector.. Selected Selected Matrix Row Matrix Row sent by root to each sent by root to each

processor.processor. Each processor computes and returns its Each processor computes and returns its

result.result. Computed results are combined into a vector Computed results are combined into a vector

by root processor.by root processor.

Results

20

Example applicationExample applicationMatrix - Vector multiplicationMatrix - Vector multiplication

11 12 13 1

21 22 23 2

31 32 33 3

a a a b

a a a b

a a a b

1

11 12 13 2

3

b

a a a b

b

Router MPI

MPIMPI

MPI

Results

21

Matrix - Vector multiplication - Matrix - Vector multiplication - resultsresults

For simple operations single processor is preferred

* Time = ticks/clk frequency

Single vs. Multi processor (4)

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Matrix size

cou

nte

r tic

ks

Single

Multi

diff(M-S)

Results

22

Matrix - Vector multiplication - Matrix - Vector multiplication - resultsresults

When the operation takes more time than the send and receive time the router becomes efficient

Single vs. Multi processor (4) - mult n times

0

10000

20000

30000

40000

50000

60000

70000

Matrix size

cou

nte

r tic

ks Single(n=2)

Multi(n=2)

Single(n=3)

Multi(n=3)

Results

23

Router statisticsRouter statistics• Transfer time

theoretical limit is 8 clock cycle time

• The limit is calculated from:3 clks for putfsl.2 clks for router read & write.3 clks for getfsl

total = 8 clks

Send/Bcast time per word

0

5

10

15

20

25

3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54

Vector size

co

un

ter

tic

ks

Send time per word

Bcast time per word

Theoretical limit

Results

24

Router statisticsRouter statisticsSend vs. Bcast

0

100

200

300

400

500

600

3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54

vector size

coun

ter

ticks

send time

bcast time

• Bcast takes more time then send.

8 77sendT N

8.751 70.34BcastT N

• The slope value (8) comes from transfer limit time.

Results

25

Router statisticsRouter statistics

Results

Timing & Troughput statistics (clk=25MHz)

avg time result Application use

avg send time: 4.45232E-07 sec 80%

avg Bcast time: 4.62887E-07 sec 20%

avg transfer time for application: 4.48763E-07 sec/word(32bit)

avg transfer time for 1 Byte 1.12191E-07 sec/byte

Mbytes/secTroughput (Application dependent) 8.913395174

26



Table of Content

27

Future DirectionsFuture Directions

Future research

Improve router performance ~400MhzImprove router performance ~400Mhz Expand network to more than 4 Expand network to more than 4

processorsprocessors

28

QUESTIONS ?QUESTIONS ?

29

• The Header consist of the fields:

Message payload Message payload

NameNameSize Size

(bits)(bits)OrderOrder DescriptionDescription

HH 11 00 Represent the HeaderRepresent the Header

DSTDST 44 1:41:4 The message destination in the COMMThe message destination in the COMM

COMMCOMM 44 5:85:8 The group of cores in the message destinationThe group of cores in the message destination

CMDCMD 44 9:129:12 The command name for this message (Send, Bcast)The command name for this message (Send, Bcast)

TYPETYPE 44 13:1613:16 The date type in this messageThe date type in this message

DATA CNTDATA CNT 1010 17:2617:26 The number of words in this messageThe number of words in this message

NameNameSize Size

(bits)(bits)OrderOrder DescriptionDescription

TT 11 00 Represent the TailRepresent the Tail

SRCSRC 44 1:41:4 The message source port in it’s SCOMMThe message source port in it’s SCOMM

SCOMMSCOMM 44 5:85:8 Group of cores in the message source portGroup of cores in the message source port

TAGTAG 1111 9:199:19 Message code, group of messages in the same topic\issueMessage code, group of messages in the same topic\issue

* Empty fields where left to allow network and functionality extensions.

• The Tail consist of the fields:

Introduction

30

Example 1 Example 1

• At each time slot part of the message is send to it’s destination as long as the destination port is not busy.

• When Port is busy the next requesting port is service (no delay).

1

H

1

H

1

H

1

H

3

2

3

2

3

2 T4

T

T 4

5

T

1 2 4 1 2 4 1 2 3 4 3 4

Port

T4321H

T321H

T1H

Messages

t

Message Data

MESSAGES DELIVERY EXAMPLE

Destination

1

2

3

4

2

1

2

3T4321H 5

NextNext Next

2 1 3 2 1 3 2 1 2 3 2 3DST

SRC

Hardware Design

31

Example 2Example 2

• If one port has no data (port 2) other ports are serviced by order.

1

H

1

H

3

2

3

2 4

T

1 3 4 1 3 4 1

Port

T4321H

T1H

Messages

t

Message Data


Destination

1

2

3

4

2

4

3T4321H 5

1

H

Next Next

T

Next

4

5

T

4 4

Next

2 4 3 2 4 3 2 3 3DST

SRC

Hardware Design

32

Example 3Example 3

• Handling BCAST command and port arbitrating while 2 ports has the same destination.

1

H

1

H H 1

3

2

T

2 H TT H 1

T

H

Port

T321H

T21H

T1H

Messages

t

Message Data


Destination

1

2

3

4

2

1

BCAST

2T4321H

Next

1 T 1

2

3

4

T

DEST

BCAST BCAST BCAST BCAST BCAST BCAST

2 1 4 2 1 1 2 2 14 42 21 2 2 2

Next Next Next Next Next Next Next Next

1 2 3 1 2 3 1 3 3 3 3 4SRC 4 4

Hardware Design

33

MPI_Send

Compose Header and

Tail

Send Header

Send Message

(body)

Send Tail

return

MPI_Send: composes header and MPI_Send: composes header and

tail, and sends it with the message tail, and sends it with the message

(body)(body)

Sending the messageSending the message

Software Design

34

Receiving the Receiving the messagemessage

Interrupt Vector: receives Interrupt Vector: receives

incoming messages, and stores incoming messages, and stores

them in suitable linked listthem in suitable linked list

Interrupt Vector

Receive Header

Allocate Node according to Header info

Receive Message

Receive Tail

Construct Node

return

Add Node to end of appropriate list

Software Design

35

Return received Return received messagemessage

MPI_Recv: message MPI_Recv: message

details received from user. details received from user.

Looks for this message in Looks for this message in

linked list of already linked list of already

received messagesreceived messages

MPI_Recv

Compose search Key for linked list

Look for message Node

Is Node found?Remove from list

Return message

Jump to Interrupt Vector (when return, some

message had arrived)noyes

Software Design

Documents

Fast Communication for Multi – Core SOPC