View
48
Download
0
Category
Tags:
Preview:
DESCRIPTION
Send and Receive Based Message-Passing for SCMP. Charles W. Lewis, Jr. Thesis Defense Virginia Tech April 28 th , 2004. A. B. Thread. A. Data. B. Sync. A. B. RTS. A. B. CTS. A. B. Data. - PowerPoint PPT Presentation
Citation preview
11
Send and Receive Based Send and Receive Based Message-Passing for SCMPMessage-Passing for SCMP
Charles W. Lewis, Jr.
Thesis DefenseVirginia Tech
April 28th, 2004
22
This presentation introduces the SCMP architecture, This presentation introduces the SCMP architecture, discusses problems with the current SCMP message-discusses problems with the current SCMP message-
passing system, and focuses on the design and passing system, and focuses on the design and performance of a new SCMP message-passing systemperformance of a new SCMP message-passing system..
1. Overview of SCMP 2. Original Message-Passing System
3. New Message-Passing System 4. Performance Comparisons
A
A B
BThread
Sync
Data
A
A B
B
CTS
A BData
RTS
33
Problems with current design Problems with current design trends motivate the SCMP concept.trends motivate the SCMP concept.
As transistor sizes shrink, so do As transistor sizes shrink, so do communication wires. This leads to higher communication wires. This leads to higher cross-chip communication latencies.cross-chip communication latencies.
ILP faces diminishing returns.ILP faces diminishing returns.
Large and complex uni-processors require Large and complex uni-processors require extensive amounts of design and extensive amounts of design and verification.verification.
44
SCMP provides PLP through SCMP provides PLP through replication.replication.
Up to 64 identical Up to 64 identical nodes on-chipnodes on-chip
Replicated nodes Replicated nodes reduce complexityreduce complexity
2-D network 2-D network eliminates cross-chip eliminates cross-chip wireswires
SCMP Network with 64 Nodes
55
SCMP provides TLP through SCMP provides TLP through multi-thread hardware support.multi-thread hardware support.
Up to 16 threads Up to 16 threads
Round-robin thread Round-robin thread scheduling by scheduling by hardwarehardware
On every node:On every node:– 4-stage RISC pipeline4-stage RISC pipeline– 8MB memory8MB memory– Networking hardwareNetworking hardware
SCMP Node
66
The original messaging system has The original messaging system has two message types.two message types.
HH TT PayloadPayload
11 00 XX YY THREADTHREAD 11
00 00 AddressAddress
00 00 Register DataRegister Data
..
..
..
00 11 Register DataRegister Data
HH TT PayloadPayload
11 00 XX YY DATADATA StrideStride
00 00 AddressAddress
00 00 DataData
..
..
..
00 11 DataData
Thread Message Data Message
Because they contain handling information these message formats borrow from the Active-Messages message-passing
system.
77
Network uses wormhole and Network uses wormhole and dimension-order routing.dimension-order routing.
0 1 2 3
4 5 6 7
Every router multiplexes virtual channel buffers over physical channels. Every router multiplexes virtual channel buffers over physical channels. Head flits claim virtual channel resources as they travelHead flits claim virtual channel resources as they travelIf one message blocks, other messages may still continue as long as If one message blocks, other messages may still continue as long as enough virtual channels are free.enough virtual channels are free.Messages move along X axis, then Y axisMessages move along X axis, then Y axisTail flits release virtual channel resources as they travel.Tail flits release virtual channel resources as they travel.
88Router
Thread VCs
Dimension-order routing is deadlock free as Dimension-order routing is deadlock free as long as messages eventually drain.long as messages eventually drain.
Even with VCs, network can Even with VCs, network can still deadlock if messages don’t still deadlock if messages don’t drain.drain.
If all contexts are consumed, If all contexts are consumed, thread messages block at NIUthread messages block at NIU– Threads may not release until Threads may not release until
a data message is receiveda data message is received– Data messages must not be Data messages must not be
stopped by congested thread stopped by congested thread messagesmessages
Data messages must have a Data messages must have a separate path through separate path through network.network. Data VCs
West East
99
NIU
The NIU bears most of the The NIU bears most of the messaging load.messaging load.
Thread Buffer
Data Buffer
Receive Buffer
Context 1Context 2
Context 2
Ejection Channel
Injection Channel
To Router
From Router
Memory
1010
Messages are built through Messages are built through assembly instructions.assembly instructions.
InstructionInstruction ArgumentsArguments DescriptionDescription
sendhsendh d_node, type, d_address, d_strided_node, type, d_address, d_stride send a header flitsend a header flit
sendsend datadata send one data flitsend one data flit
send2send2 data, datadata, data send two data flitssend two data flits
sendesende datadata send one data flit and end messagesend one data flit and end message
send2esend2e data, datadata, data send two data flits and end messagesend two data flits and end message
sendmsendm l_address, {l_stride, count}l_address, {l_stride, count} send data block from memorysend data block from memory
sendmesendme l_address, {l_stride, count}l_address, {l_stride, count} send data block from memory and end send data block from memory and end messagemessage
1111
The thread library facilitates thread The thread library facilitates thread messages.messages.
OperationOperation ArgumentsArguments DescriptionDescription
createThreadcreateThread int dst_nodeint dst_node create a thread on dst_nodecreate a thread on dst_node
void(*addr)()void(*addr)()
void(*callback)()void(*callback)()
……
parExecuteparExecute int num_nodesint num_nodes create threads oncreate threads on
void(*addr)()void(*addr)() num_nodes nodesnum_nodes nodes
……
getBlockgetBlock unsigned int node_idunsigned int node_id request a block of valuesrequest a block of values
char *dst_addrchar *dst_addr from node node_idfrom node node_id
unsigned int dst_strideunsigned int dst_stride
char **src_addrchar **src_addr
unsigned int src_offsetunsigned int src_offset
unsigned int src_strideunsigned int src_stride
unsigned int num_wordsunsigned int num_words
1212
The send library facilitates data messages.The send library facilitates data messages.OperationOperation ArgumentsArguments DescriptionDescription
sendDataIntValuesendDataIntValue int dst_nodeint dst_node send an integer to dst_nodesend an integer to dst_node
int *dst_addrint *dst_addr
int valueint value
sendDataFloatValuesendDataFloatValue int dst_nodeint dst_node send a double to dst_nodesend a double to dst_node
double *dst_addrdouble *dst_addr
double valuedouble value
sendDataBlocksendDataBlock int dst_nodeint dst_node send a block of values fromsend a block of values from
int *dst_addrint *dst_addr memory to dst_nodememory to dst_node
int dst_strideint dst_stride (blocking)(blocking)
int *src_addrint *src_addr
int src_strideint src_stride
int num_wordsint num_words
sendDataBlockNBsendDataBlockNB int dst_nodeint dst_node send a block of values fromsend a block of values from
int *dst_addrint *dst_addr memory to dst_nodememory to dst_node
int dst_strideint dst_stride (non-blocking)(non-blocking)
int *src_addrint *src_addr
int src_strideint src_stride
int num_wordsint num_words
1313
The original message-passing The original message-passing system uses requests and replies.system uses requests and replies.
Node A requires data held Node A requires data held by Node Bby Node B
Node A creates a Node A creates a thread on Node Bthread on Node BNew thread on Node B New thread on Node B sends data to Node Asends data to Node ANew thread on Node B New thread on Node B sends SYNC message sends SYNC message when donewhen done
B
Thread
Sync
DataA
A B
1414
Dynamic memory is a problem.Dynamic memory is a problem.HH TT PayloadPayload
11 00 XX YY DATADATA StrideStride
00 00 AddressAddress
00 00 DataData
..
..
..
00 11 DataData
Request thread on node B Request thread on node B must know:must know:– Source AddressSource Address– Source StrideSource Stride– Destination AddressDestination Address– Destination StrideDestination Stride– Number of Values to SendNumber of Values to Send
How can Node A know the How can Node A know the source address and stride source address and stride if Node B allocates the if Node B allocates the buffer dynamically?buffer dynamically?Program must contain Program must contain global pointersglobal pointers
1515
In-order delivery of messages is a In-order delivery of messages is a problem.problem.
SCMP network does not guarantee in-order delivery of SCMP network does not guarantee in-order delivery of messagesmessagesSYNC message may reach Node A before data SYNC message may reach Node A before data messagemessageNode A will read bad values from memoryNode A will read bad values from memory
BSync
DataA
1616
Request threads and finite thread Request threads and finite thread contexts are a problem.contexts are a problem.
If a node holds highly demanded data, request threads If a node holds highly demanded data, request threads may consume all of its contextsmay consume all of its contextsAdditional thread messages will block in the networkAdditional thread messages will block in the network
Contexts0X0000de5a0X00000f700X00000ff80X00000ff80X00000ff80X00000ff8
0X00000ff80X00000ff8
NIU Thread Thread Thread
1717
Send-and-Receive message-passing Send-and-Receive message-passing eliminates eliminates allall of these problems. of these problems.
A thread must execute a receive before A thread must execute a receive before data will be accepteddata will be accepted– Don’t need request messagesDon’t need request messages
Messages are identified abstractlyMessages are identified abstractly– Don’t need global pointersDon’t need global pointersCompletion notification occurs locallyCompletion notification occurs locally– Don’t need SYNC messagesDon’t need SYNC messages
1818
Rendezvous mode uses an Rendezvous mode uses an RTS/CTS handshake.RTS/CTS handshake.
Node B holds data required by Node B holds data required by Node ANode A
Node B sends Node A an Node B sends Node A an RTS message when send is RTS message when send is executedexecutedAfter receive is executed After receive is executed Node A sends Node B a Node A sends Node B a CTS messageCTS messageNode B sends data after Node B sends data after receiving RTSreceiving RTS
BDataA
A B
BCTSA
RTS
1919
Ready mode foregoes the handshake Ready mode foregoes the handshake to reduce message latency.to reduce message latency.
Node B holds data required by Node B holds data required by Node ANode A
Node B sends data Node B sends data when send is executedwhen send is executed
User must ensure that User must ensure that receive has executed receive has executed on Node Aon Node A
BDataA
2020
The implementation centers around The implementation centers around two tables.two tables.
3333 22 11 00
idid statestate
8383 5050 4949 2929 2828 1313 1212 77 66 33 22 00
idid addressaddress stridestride r_noder_node r_cntxtr_cntxt statestate
Send Table Entry
Receive Table Entry
2121
Send Table Entries may be in 4 states, and Send Table Entries may be in 4 states, and Receive Table Entries may be in 5 states.Receive Table Entries may be in 5 states.
ValueValue StateState0000 EmptyEmpty0101 In UseIn Use1010 In ProgressIn Progress1111 CompleteComplete
ValueValue StateState000000 EmptyEmpty001001 In UseIn Use010010 In ProgressIn Progress011011 RTS ReceivedRTS Received10X10X NOT USEDNOT USED110110 NOT USEDNOT USED111111 CompleteComplete
Send Table Entry States Receive Table Entry States
2222
The new messaging system has The new messaging system has four message types.four message types.
HH TT PayloadPayload
11 00 XX YY THREADTHREAD
11 11 Handler AddressHandler Address
00 00 Register DataRegister Data
..
..
..
00 11 Register DataRegister Data
HH TT PayloadPayload
11 00 XX YY DATADATA
11 11 Message IDMessage ID
00 00 DataData
..
..
..
00 11 DataData
HH TT PayloadPayload
11 00 XX YY RTSRTS cntxtcntxt nodenode
00 11 Message IDMessage ID
Thread Message Data Message
HH TT PayloadPayload
11 00 XX YY CTSCTS cntxtcntxt
00 11 Message IDMessage ID
RTS Message CTS Message
2323NIU
The NIU now contains a data queue for every context.The NIU now contains a data queue for every context.
Data Buffer
Context 1Context 2
Context 2
Ejection Channel
Injection Channel
To Router
From Router
Thread Buffer
Receive Buffer
CTS Buffer
RTS Buffer
Memory
2424
Only five new instructions and one Only five new instructions and one modified instruction are needed.modified instruction are needed.
InstructionInstruction ArgumentsArguments DescriptionDescription
sendhsendh d_node, type, d_addressd_node, type, d_address send a header flitsend a header flit
sendsend datadata send one data flitsend one data flit
send2send2 data, datadata, data send two data flitssend two data flits
sendesende datadata send one data flit and end messagesend one data flit and end message
send2esend2e data, datadata, data send two data flits and end messagesend two data flits and end message
sendmsendm l_address, {l_stride, count}l_address, {l_stride, count} send data block from memorysend data block from memory
sendmesendme l_address, {l_stride, count}l_address, {l_stride, count} send data block from memory and end send data block from memory and end messagemessage
ldssldss r, message_idr, message_id poll send operation statuspoll send operation status
ldsrldsr r, message_idr, message_id poll receive operation statuspoll receive operation status
strstr message_id, address, stridemessage_id, address, stride store a receive to tablestore a receive to table
rmsrms message_idmessage_id clear a send operationclear a send operation
rmrrmr message_idmessage_id clear a receive operationclear a receive operation
2525
The thread library remains nearly the same.The thread library remains nearly the same.OperationOperation ArgumentsArguments DescriptionDescription
createThreadcreateThread int dst_nodeint dst_node create a thread on dst_nodecreate a thread on dst_node
void(*addr)()void(*addr)()
void(*callback)()void(*callback)()
……
parExecuteparExecute int num_nodesint num_nodes create threads oncreate threads on
void(*addr)()void(*addr)() num_nodes nodesnum_nodes nodes
……
getBlockgetBlock unsigned int node_idunsigned int node_id request a block of valuesrequest a block of values
char *dst_addrchar *dst_addr from node node_idfrom node node_id
unsigned int dst_strideunsigned int dst_stride
char **src_addrchar **src_addr
unsigned int src_offsetunsigned int src_offset
unsigned int src_strideunsigned int src_stride
message_idmessage_id
unsigned int num_wordsunsigned int num_words
2626
The new send library is more familiar.The new send library is more familiar.OperationOperation ArgumentsArguments DescriptionDescription
SCMPSendIntSCMPSendInt int dst_nodeint dst_node send an integer to dst_nodesend an integer to dst_node
int message_idint message_id
int valueint value
SCMPSendFloatSCMPSendFloat int dst_nodeint dst_node send a double to dst_nodesend a double to dst_node
int message_idint message_id
double valuedouble value
SCMPSendSCMPSend int dst_nodeint dst_node send a block of values fromsend a block of values from
int message_idint message_id memory to dst_nodememory to dst_node
int *addressint *address (blocking)(blocking)
int strideint stride
int num_wordsint num_words
SCMPSendNBSCMPSendNB int dst_nodeint dst_node send a block of values fromsend a block of values from
int message_idint message_id memory to dst_nodememory to dst_node
int *addressint *address (non-blocking)(non-blocking)
int strideint stride
int num_wordsint num_words
SCMPPollSendSCMPPollSend int message_idint message_id poll status of send operationpoll status of send operation
SCMPWaitSendSCMPWaitSend int message_idint message_id suspend until message sendssuspend until message sends
SCMPClearSendSCMPClearSend int message_idint message_id clear send operationclear send operation
2727
The receive library is all new.The receive library is all new.OperationOperation ArgumentsArguments DescriptionDescription
SCMPReceiveSCMPReceive int message_idint message_id receive a message andreceive a message and
int* addressint* address store it at addressstore it at address
int strideint stride (blocking)(blocking)
SCMPReceiveNBSCMPReceiveNB int message_idint message_id receive a message andreceive a message and
int *addressint *address store it at addressstore it at address
int strideint stride (non-blocking)(non-blocking)
SCMPPollReceiveSCMPPollReceive int message_idint message_id poll status of receive poll status of receive operationoperation
SCMPWaitReceiveSCMPWaitReceive int message_idint message_id suspend until message suspend until message arrivesarrives
SCMPClearReceiveSCMPClearReceive int message_idint message_id clear receive operationclear receive operation
2828
Rendezvous Mode Operation at the Rendezvous Mode Operation at the SenderSender
CTS MessageArrives
Entry->Complete
Tail FlitNot SentSend Flit
sendh
No Entry?
Queue Head And Tag
SUSPEND
Create: Entry->In Use
Head Flit @ Queue Head
Send RTS
No Entry? ERROR
Entry->In Progress
QueueWaitingERROR
T
F
T
T
F
F
2929
Rendezvous Mode Operation at the Rendezvous Mode Operation at the ReceiverReceiverRTS Message
Arrives
Entry->Complete
Tail FlitNot StoredStore Flit
No Entry
In Use
Record RTS
Send CTS
Block RTS
Data Message Arrives
Entry->In Progress
No EntryDISCARD
In ProgressBlock Data
str
Entry->RTS Rcvd
No Entry RTSRcvd
Record str
Entry->In Use
Send CTS
Entry->In Progress
SUSPEND
T
F
F
F F
F
F
T T
T
T
T
3030Router
Thread VCs
RTS and CTS Messages also need RTS and CTS Messages also need separate VC paths.separate VC paths.
RTS messages can block RTS messages can block in the network.in the network.
For a given RTS For a given RTS message to leave the message to leave the network, RTS messages network, RTS messages ahead of it must be ahead of it must be satisfiedsatisfied– CTS message to sourceCTS message to source– Data message backData message back
RTS and CTS messages RTS and CTS messages have their own VC paths.have their own VC paths.
West EastData VCs
RTS VCs
CTS VCs
3131
Ready Mode Operation at the Ready Mode Operation at the SenderSender
sendh
No Entry?
Queue Head And Tag
SUSPEND
Create: Entry->In Use
F
T
Entry->Complete
Tail FlitNot SentSend Flit
ERROR
T
F
No Entry?
Head Flit @ Queue Head
Entry->In Progress
3232
Ready Mode Operation at the Ready Mode Operation at the ReceiverReceiver
str
No Entry
Record str
Entry->In Use
SUSPENDF
T
Entry->Complete
Tail FlitNot StoredStore Flit
Data Message Arrives
No EntryDISCARD
In ProgressBlock Data
F
F
T
T
3333
Stressmark testing was used to Stressmark testing was used to verify that performance was not verify that performance was not
hurt.hurt.DIS Stressmark SuiteDIS Stressmark Suite– Neighborhood StressmarkNeighborhood Stressmark– Matrix StressmarkMatrix Stressmark– Transitive Closure StressmarkTransitive Closure Stressmark
LU Factorization StressmarkLU Factorization Stressmark
3434
The neighborhood stressmark The neighborhood stressmark measures image texture.measures image texture.
Every node owns a portion of the Every node owns a portion of the total rowstotal rowsEvery row owns complete sum Every row owns complete sum and difference histogramsand difference histogramsEach node determines, and Each node determines, and requests, the pair’s for pixels in requests, the pair’s for pixels in its rowsits rowsEach node fills in sum and Each node fills in sum and difference histogramdifference histogramHistograms are sharedHistograms are shared– Each node manages only a Each node manages only a
portion of each histogramportion of each histogram– Only the correct portion is sent to Only the correct portion is sent to
a nodea node
3535
Queues with 16 flits perform best.Queues with 16 flits perform best.
8
10
12
14
16
18
20
22
24
26
28
10 20 30 40 50 60 70Number of Processors
Spee
dup
128 Length64 Length32 Length16 Length8 Length4 Length2 Length
3636
The new system out performs the old under The new system out performs the old under the neighborhood stressmark.the neighborhood stressmark.
Seven-Bit Pixels
0
10
20
30
40
50
60
70
0 20 40 60 80Number of Processors
Spee
dup Ideal
Original SCMP SystemNew SCMP System
Eleven-Bit Pixels
0
10
20
30
40
50
60
70
0 20 40 60 80Number of Processors
Spee
dup Ideal
Original SCMP SystemNew SCMP System
Fifteen-Bit Pixels
0
10
20
30
40
50
60
70
0 20 40 60 80Number of Processors
Spee
dup Ideal
Original SCMP SystemNew SCMP System
3737
Matrix stressmark solves a linear system of Matrix stressmark solves a linear system of equations using the Conjugate Gradient equations using the Conjugate Gradient
Method.Method.
bxA
Additional vectors Additional vectors rr and and pp used for intermediate steps used for intermediate stepsEvery node has:Every node has:– Rows of Rows of AA– Elements of Elements of bb and and rr– Complete Complete xx and and pp
After each iteration p must be globally redistributedAfter each iteration p must be globally redistributed– Share with columnsShare with columns– Share with rowsShare with rows
3838
The new system provides marginal The new system provides marginal improvement over the original under the improvement over the original under the
matrix stressmark.matrix stressmark.
0
10
20
30
40
50
60
70
0 20 40 60 80Number of Processors
Spee
dup Ideal
Original SCMP SystemNew SCMP System
3939
The transitive closure stressmark solves the The transitive closure stressmark solves the all-pairs shortest-path problem.all-pairs shortest-path problem.
Floyd-Warshall AlgorithmFloyd-Warshall Algorithm– Adjacency MatrixAdjacency Matrix
DD[[ii][][jj]]– Iterative ImprovementsIterative Improvements
DD[[ii][][jj] = min(] = min(DD[[ii][][jj], ], DD[[ii][][kk]+]+DD[[kk][][jj])])
Each node owns sub-block of Each node owns sub-block of adjacency matrixadjacency matrix– Each node needs portion of Each node needs portion of
row krow k– Each node needs portion of Each node needs portion of
column kcolumn k
03
9
10
4
6
7
1
8
5
2
1112
1314
15
4040
The new system provides marginal The new system provides marginal improvement over the original under the improvement over the original under the
transitive closure stressmark.transitive closure stressmark.
0
10
20
30
40
50
60
70
0 20 40 60 80Number of Processors
Spee
dup Ideal
Original SCMP SystemNew SCMP System
4141
The LU factorization stressmark is The LU factorization stressmark is used by linear system solvers.used by linear system solvers.
Factors matrix into a Factors matrix into a lower triangular matrix lower triangular matrix and an upper triangular and an upper triangular matrix.matrix.Matrix is divided into Matrix is divided into blocksblocksPivot block is factoredPivot block is factoredPivot column and row Pivot column and row blocks are divided by blocks are divided by pivot.pivot.Inner active matrix blocks Inner active matrix blocks are modified by the pivot are modified by the pivot row and column blocks.row and column blocks.
Pivot
Pivot Row
Pivot ColumnInner Active Matrix
4242
The new system out performs the original The new system out performs the original under the LU factorization stressmark.under the LU factorization stressmark.
Rendezvous Mode
0
10
20
30
40
50
60
70
0 20 40 60 80Number of Prcessors
Spee
dup Ideal
Original SCMP SystemNew SCMP System
Ready Mode
0
10
20
30
40
50
60
70
0 20 40 60 80Number of Processors
Spee
dup Ideal
Original SCMP SystemNew SCMP System
4343
Send-and-Receive Messaging for Send-and-Receive Messaging for SCMP is worthwhile.SCMP is worthwhile.
Fixes Problems With Original SCMP Fixes Problems With Original SCMP Messaging SystemMessaging System– Global Buffer PointersGlobal Buffer Pointers– Races between Data and SYNC messagesRaces between Data and SYNC messages– Request Thread StormsRequest Thread Storms
Programming Model is more familiarProgramming Model is more familiarPerformance is betterPerformance is better
Questions?
Recommended