Upload
chaela
View
34
Download
0
Embed Size (px)
DESCRIPTION
Fast Communication for Multi – Core SOPC. Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab. Spring 2007. End Project Presentation. Supervisor: Evgeny Fiksman Performed by: Moshe Bino Alex Tikh. One year project. - PowerPoint PPT Presentation
Citation preview
1
FastFast Communication Communication
for Multi – Core SOPCfor Multi – Core SOPC
Technion – Israel Institute of TechnologyDepartment of Electrical EngineeringHigh Speed Digital Systems Lab
Supervisor: Evgeny Fiksman
Performed by:Moshe BinoAlex Tikh
Spring 2007
End Project End Project PresentationPresentation
One year project
2
Table of ContentTable of Content
IntroductionIntroduction Hardware DesignHardware Design Software DesignSoftware Design Debug ProcessDebug Process ResultsResults Future ResearchFuture Research
Table of Content
3
Table of ContentTable of Content
IntroductionIntroduction Hardware DesignHardware Design Software DesignSoftware Design Debug ProcessDebug Process ResultsResults Future ResearchFuture Research
Table of Content
4
Problem statementProblem statement Single CPU is reaching its technological limits, Single CPU is reaching its technological limits,
e.g. heat dissipation and cost/power ratio.e.g. heat dissipation and cost/power ratio. Thus parallel computing evolved, utilizing multi Thus parallel computing evolved, utilizing multi
core processor paradigm.core processor paradigm. Three major inter-communication techniques are:Three major inter-communication techniques are:
Shared memoryShared memory Remote procedure calls.Remote procedure calls. Message passing – (MPI)Message passing – (MPI)
Introduction
5
Project OverviewProject Overview
Introduction
R
#1
R
#2
R
#3
R
#4
R
#1
R
#4
R
#1
R
#2
R
#1
R
#2
R
#3
R
#2
R
#3
R
#4
R
#3
R
#4
R
R
R
R
RR RR R
Comm #1 Comm #2
Comm #3 Comm #4
FPGA
ROUTER
IP COREMODULE
BIDIRECTIONALLINK
Rank no’ in COMM
• Mesh topology NoC• Routing nodes
• Leaf processor cores
• MPI logically defines clusters• Comm
• Rank
• Cores amount is limited
only by chip resources
6
The following components are to be implemented:The following components are to be implemented: Quad core system.Quad core system. NoC router (4 ports) and infrastructure for fast NoC router (4 ports) and infrastructure for fast
communication in multi-core system.communication in multi-core system. Chosen MPI functions written in C.Chosen MPI functions written in C. Software application demonstrating the Software application demonstrating the
advantages of a parallel system (written in C).advantages of a parallel system (written in C).
Project goalsProject goals
Introduction
7
Block diagramBlock diagram
Introduction
#3
#1
#4#2
FSL BUS FSL BUS
FS
L B
US
FS
L B
US
MPIROUTER
LMB
BU
S
OP
B B
US
I/O
LMB BUS
LMB
BU
S
* OPB – On Chip Peripheral Bus* FSL – Fast Simplex Link* LMB – Local Memory Bus
LMB
BU
S
CLKMuktiplier
x1x4
Rou
ter
MB
CLK
Int Hdler
Int Hdler
Int Hdler
Int Hdler
MEMORY
MEMORY
MEMORY
MEMORY
• Multi – core IP’s• Bi - directional link FSL• Local memory• Main core connected to I/O• Multi - clock domain
• System on programmable
chip implemented on FPGA
8
Constrains:Constrains: FPGA (V2P) maximum clock frequency 400MHz.FPGA (V2P) maximum clock frequency 400MHz. MicroBlaze (MB) core maximum frequency 100MHz.MicroBlaze (MB) core maximum frequency 100MHz. Router maximum frequency 200MHzRouter maximum frequency 200MHz Processors Memory size 64kbyte. (code + data).Processors Memory size 64kbyte. (code + data). Processor to FSL access time - 3 clock cycles.Processor to FSL access time - 3 clock cycles. Maximum FSL buffer depth is 128 - equals 0.5kbyte.Maximum FSL buffer depth is 128 - equals 0.5kbyte. Interrupt handle time - 20 clock cycles Interrupt handle time - 20 clock cycles (no interrupts nesting).(no interrupts nesting). Measurement system:Measurement system: Router works at 100MHz frequency.Router works at 100MHz frequency. MB works at 25MHz frequency.MB works at 25MHz frequency. FSL depth is 64 word - 0.25kbyte FSL depth is 64 word - 0.25kbyte Router is designed for relatively small messages – max. Router is designed for relatively small messages – max.
1kbyte due to processors & FPGA chip memory size. 1kbyte due to processors & FPGA chip memory size.
System specificationsSystem specifications
Introduction
9
Table of ContentTable of Content
IntroductionIntroduction Hardware DesignHardware Design Software DesignSoftware Design Debug ProcessDebug Process ResultsResults Future ResearchFuture Research
Table of Content
10
CROSS BAR
FSL
FSL
MicroB
laze #4
H/TCtrl Bit
FSL
FSL
MicroBlaze #1
H/T
Ctrl B
it
FSL
FSL
MicroB
laze #2
H/TCtrl Bit
FSL
FSL
MicroBlaze #4
H/TCtrl Bit
Router ImplementationRouter Implementation
Hardware Design
11
CROSS – BAR CROSS – BAR
Hardware Design
Cross Bar – Low Level
Clk Rst
Req
Des
t
Prem
it
Des
t
Pre
mit
Req
Dest
Premit
Req
Dest
Premit
Control B
us II
Control Bus II
Control Bus II
Permission Unit
Port
Controls3
Timer & Enable Unit
Control Bus I
Control Bus I
Data Bus 32 Bits
Data Bus 32 Bits
Data B
us
Data B
us
2
Bus I Interface Port2
Bus I Interface
Port2
Bus I Interface
Bus
I In
terfa
ceP
ort 2
Port2
Fsl_S_D
ata
Fsl_
M_D
ata
Port #3 FSM
Fsl_
S_R
ead
Fsl_
S_C
ontro
l
Fsl_
S_H
asD
ata
TO\FROM FSL
Fsl_M_W
rite
Fsl_M_C
ontrol
Fsl_M_Full
Bus II & Data Bus Interface
Port
2
Fsl_S_Data
Fsl_M_Data
Por
t #2
FSM
Fsl_S_Read
Fsl_S_Control
Fsl_S_HasData
TO\F
RO
M F
SL
Fsl_M_Write
Fsl_M_Control
Fsl_M_Full
Port2
Fsl_
S_D
ata
Fsl_M_D
ata
Port #1 FSMFsl_S
_Read
Fsl_S_C
ontrol
Fsl_S_H
asData
TO\FROM FSL
Fsl_
M_W
rite
Fsl_
M_C
ontro
l
Fsl_
M_F
ull
Por
t2
Fsl_S_Data
Fsl_M_Data
Port #4 FS
M
Fsl_S_Read
Fsl_S_Control
Fsl_S_HasData
TO\FR
OM
FSL
Fsl_M_Write
Fsl_M_Control
Fsl_M_Full
Port2
Bus
II &
Dat
a
Bus
Inte
rface
Bus II &
Data
Bus Interface
Bus II & Data Bus Interface
Dest2
Dest
2
Dest2
Des
t2
Dest2
COMM COMM
CO
MM
CO
MM
Bcast
Bca
st
Bcast
Bca
stR
eq
BcastPriority
• Two main units: Permission Unit Port FSM
• Time limited
Round Robin arbiter
• Port to Port & broadcasting
• Smart Connectivity• R – R• R - Core
• Modular design
12
Table of ContentTable of Content
IntroductionIntroduction Hardware DesignHardware Design Software DesignSoftware Design Debug ProcessDebug Process ResultsResults Future ResearchFuture Research
Table of Content
13
System software LayersSystem software Layers
• Application• MPI functions interface
• Network• hardware independent
implementation
• Data• relies on message
structure
• Physical• designed for FSL bus
Design modularity in hardware and software
Physical
Data
Network
Application
MicroBlaze#1
MPI
Physical
Data
Network
Application
MicroBlaze#2
Software design
14
MPI MPI Functions setFunctions setSome of the implemented functions are Some of the implemented functions are trivial, and required by MPI standard.trivial, and required by MPI standard.
MPI_InitMPI_Comm_rankMPI_Comm_sizeMPI_Finalize
Non-trivial functions, used for inter-Non-trivial functions, used for inter-processors communication.processors communication.
MPI_SendMPI_RecvMPI_Bcast
Three additional complimentary Three additional complimentary functions.functions.Supply additional info about the Supply additional info about the received message.received message.
MPI_Get_sourceMPI_Get_countMPI_Get_tag
Software design
15
Table of ContentTable of Content
IntroductionIntroduction Hardware DesignHardware Design Software DesignSoftware Design Debug ProcessDebug Process ResultsResults Future ResearchFuture Research
Table of Content
16
InputMessages
File
OutputMessages
File
OutputMessages
File
Write ToFSL
Read FromFSL
Fsl_out_data
Fsl_in_data
Test Bench
• The Test Bench write
messages to the FSL pipe
(MB output side) and reads
messages from the pipe (MB
input side).
• Signals can also be viewed
in ModelSim
Debug – SimulationDebug – Simulation
Debug Process
textR
Gen
Gen
Gen
Gen
HW – Simulation Env.
17
Debug process – Real time Debug process – Real time
Software debug using 2 MB system.Software debug using 2 MB system.
Debug mainly done using printf function Debug mainly done using printf function for plotting results to monitor trough for plotting results to monitor trough UART.UART.
Hardware debug was using chip scope Hardware debug was using chip scope application and LEDs for indicationapplication and LEDs for indication
Core# 1
SW – Development Env.
Router
Core# 2
Debug Process
18
Table of ContentTable of Content
IntroductionIntroduction Hardware DesignHardware Design Software DesignSoftware Design Debug ProcessDebug Process ResultsResults Future ResearchFuture Research
Table of Content
19
Example applicationExample applicationMatrix - Vector multiplicationMatrix - Vector multiplication
Typical example of highly parallel application.Typical example of highly parallel application. Root processor broadcasts Root processor broadcasts VectorVector.. Selected Selected Matrix Row Matrix Row sent by root to each sent by root to each
processor.processor. Each processor computes and returns its Each processor computes and returns its
result.result. Computed results are combined into a vector Computed results are combined into a vector
by root processor.by root processor.
Results
20
Example applicationExample applicationMatrix - Vector multiplicationMatrix - Vector multiplication
11 12 13 1
21 22 23 2
31 32 33 3
a a a b
a a a b
a a a b
1
11 12 13 2
3
b
a a a b
b
Router MPI
MPIMPI
MPI
Results
21
Matrix - Vector multiplication - Matrix - Vector multiplication - resultsresults
For simple operations single processor is preferred
* Time = ticks/clk frequency
Single vs. Multi processor (4)
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Matrix size
cou
nte
r tic
ks
Single
Multi
diff(M-S)
Results
22
Matrix - Vector multiplication - Matrix - Vector multiplication - resultsresults
When the operation takes more time than the send and receive time the router becomes efficient
Single vs. Multi processor (4) - mult n times
0
10000
20000
30000
40000
50000
60000
70000
Matrix size
cou
nte
r tic
ks Single(n=2)
Multi(n=2)
Single(n=3)
Multi(n=3)
Results
23
Router statisticsRouter statistics• Transfer time
theoretical limit is 8 clock cycle time
• The limit is calculated from:3 clks for putfsl.2 clks for router read & write.3 clks for getfsl
total = 8 clks
Send/Bcast time per word
0
5
10
15
20
25
3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54
Vector size
co
un
ter
tic
ks
Send time per word
Bcast time per word
Theoretical limit
Results
24
Router statisticsRouter statisticsSend vs. Bcast
0
100
200
300
400
500
600
3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54
vector size
coun
ter
ticks
send time
bcast time
• Bcast takes more time then send.
8 77sendT N
8.751 70.34BcastT N
• The slope value (8) comes from transfer limit time.
Results
25
Router statisticsRouter statistics
Results
Timing & Troughput statistics (clk=25MHz)
avg time result Application use
avg send time: 4.45232E-07 sec 80%
avg Bcast time: 4.62887E-07 sec 20%
avg transfer time for application: 4.48763E-07 sec/word(32bit)
avg transfer time for 1 Byte 1.12191E-07 sec/byte
Mbytes/secTroughput (Application dependent) 8.913395174
26
Table of ContentTable of Content
IntroductionIntroduction Hardware DesignHardware Design Software DesignSoftware Design Debug ProcessDebug Process ResultsResults Future ResearchFuture Research
Table of Content
27
Future DirectionsFuture Directions
Future research
Improve router performance ~400MhzImprove router performance ~400Mhz Expand network to more than 4 Expand network to more than 4
processorsprocessors
28
QUESTIONS ?QUESTIONS ?
29
• The Header consist of the fields:
Message payload Message payload
NameNameSize Size
(bits)(bits)OrderOrder DescriptionDescription
HH 11 00 Represent the HeaderRepresent the Header
DSTDST 44 1:41:4 The message destination in the COMMThe message destination in the COMM
COMMCOMM 44 5:85:8 The group of cores in the message destinationThe group of cores in the message destination
CMDCMD 44 9:129:12 The command name for this message (Send, Bcast)The command name for this message (Send, Bcast)
TYPETYPE 44 13:1613:16 The date type in this messageThe date type in this message
DATA CNTDATA CNT 1010 17:2617:26 The number of words in this messageThe number of words in this message
NameNameSize Size
(bits)(bits)OrderOrder DescriptionDescription
TT 11 00 Represent the TailRepresent the Tail
SRCSRC 44 1:41:4 The message source port in it’s SCOMMThe message source port in it’s SCOMM
SCOMMSCOMM 44 5:85:8 Group of cores in the message source portGroup of cores in the message source port
TAGTAG 1111 9:199:19 Message code, group of messages in the same topic\issueMessage code, group of messages in the same topic\issue
* Empty fields where left to allow network and functionality extensions.
• The Tail consist of the fields:
Introduction
30
Example 1 Example 1
• At each time slot part of the message is send to it’s destination as long as the destination port is not busy.
• When Port is busy the next requesting port is service (no delay).
1
H
1
H
1
H
1
H
3
2
3
2
3
2 T4
T
T 4
5
T
1 2 4 1 2 4 1 2 3 4 3 4
Port
T4321H
T321H
T1H
Messages
t
Message Data
MESSAGES DELIVERY EXAMPLE
Destination
1
2
3
4
2
1
2
3T4321H 5
NextNext Next
2 1 3 2 1 3 2 1 2 3 2 3DST
SRC
Hardware Design
31
Example 2Example 2
• If one port has no data (port 2) other ports are serviced by order.
1
H
1
H
3
2
3
2 4
T
1 3 4 1 3 4 1
Port
T4321H
T1H
Messages
t
Message Data
MESSAGES DELIVERY EXAMPLE
Destination
1
2
3
4
2
4
3T4321H 5
1
H
Next Next
T
Next
4
5
T
4 4
Next
2 4 3 2 4 3 2 3 3DST
SRC
Hardware Design
32
Example 3Example 3
• Handling BCAST command and port arbitrating while 2 ports has the same destination.
1
H
1
H H 1
3
2
T
2 H TT H 1
T
H
Port
T321H
T21H
T1H
Messages
t
Message Data
MESSAGES DELIVERY EXAMPLE
Destination
1
2
3
4
2
1
BCAST
2T4321H
Next
1 T 1
2
3
4
T
DEST
BCAST BCAST BCAST BCAST BCAST BCAST
2 1 4 2 1 1 2 2 14 42 21 2 2 2
Next Next Next Next Next Next Next Next
1 2 3 1 2 3 1 3 3 3 3 4SRC 4 4
Hardware Design
33
MPI_Send
Compose Header and
Tail
Send Header
Send Message
(body)
Send Tail
return
MPI_Send: composes header and MPI_Send: composes header and
tail, and sends it with the message tail, and sends it with the message
(body)(body)
Sending the messageSending the message
Software Design
34
Receiving the Receiving the messagemessage
Interrupt Vector: receives Interrupt Vector: receives
incoming messages, and stores incoming messages, and stores
them in suitable linked listthem in suitable linked list
Interrupt Vector
Receive Header
Allocate Node according to Header info
Receive Message
Receive Tail
Construct Node
return
Add Node to end of appropriate list
Software Design
35
Return received Return received messagemessage
MPI_Recv: message MPI_Recv: message
details received from user. details received from user.
Looks for this message in Looks for this message in
linked list of already linked list of already
received messagesreceived messages
MPI_Recv
Compose search Key for linked list
Look for message Node
Is Node found?Remove from list
Return message
Jump to Interrupt Vector (when return, some
message had arrived)noyes
Software Design