Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
RDMA Read Based Rendezvous Protocol for MPI
over InfiniBand: Design Alternatives and Benefits
Sayantan Sur Hyun-Wook Jin Lei Chai
D. K. Panda
Network Based Computing Lab,
The Ohio State University
Presentation Outline
� Introduction and Motivation
� Problem Statement� Detailed Design Description
� Design Evaluation Framework� Micro-benchmark Level Evaluation
� Application Level Evaluation� Conclusions and Future Work
Introduction
• MPI is a popular parallel programming model
• Offers several point-to-point communication semantics– Non-blocking (MPI_Isend, MPI_Irecv …)– Blocking (MPI_Send, MPI_Recv …)– Synchronous (MPI_Ssend, MPI_Issend …)
• Non-blocking point-to-point communication is hugely popular among application writers
Why are Non-Blocking Semantics Popular?
• Sending and receiving processes can progress independently without blocking
• Enables “Computation/Communication” overlap
• Several other parallel programming models feature non-blocking semantics– PGAS {UPC, HPF}– ARMCI
Message Passing Protocols• MPI utilizes two major types of protocols
– Eager• Used for small messages (buffered)
– Rendezvous• Used for large messages (un-buffered)
• Reduces memory requirement by MPI library
Sender Receiver Sender Receiver
Send
RNDZ_START
RNDZ_REPLY
DATA
FIN
Eager Protocol Rendezvous Protocol
Is Overlap Always Possible?����������� ��� ��������������� ����������������������������� ���
����� � �� ������������������������� � � ���
MPI_Isend RNDZ_START
RNDZ_REPLY
DATA
FIN
MPI_Irecv
MPI_Wait
MPI_WaitCompute
Communicate
Compute
Communicate
������ �������
How can InfiniBand help?• InfiniBand is an industry standard HPC interconnect• Very good performance with many features
– Minimum Latency: ~2us, Peak Bandwidth: ~1500MB/s– One sided RDMA (Remote DMA), Atomic operations– Hardware multicast, Quality of Service …
• RDMA is a powerful mechanism– Zero copy (network can directly DMA from user buffers)– No remote side involvement– Both Write and Read semantics are supported
• Need to design Rendezvous Protocol which leverages all the novel features for InfiniBand in order to achieve Computation/Communication overlap
Presentation Outline
� Introduction and Motivation
� Problem Statement� Detailed Design Description
� Design Evaluation Framework� Micro-benchmark Level Evaluation
� Application Level Evaluation� Conclusions and Future Work
Problem Statement
• Can we design a Rendezvous protocol which can achieve full overlap of computation and communication?
• Can this new protocol reduce the communication time experienced by end applications?
Presentation Outline
� Introduction and Motivation
� Problem Statement� Detailed Design Description
� Design Evaluation Framework� Micro-benchmark Level Evaluation
� Application Level Evaluation� Conclusions and Future Work
Design Overview
• Design a new RDMA Read based Rendezvous protocol– Minimize control messages
• Trigger “automatic” progress with interrupts– Interrupts are costly (~2 times round-trip latency)– Reduce interrupts using
• Selective Interrupts• Interrupt suppression• Dynamic Interrupt Requests
• Hybrid Communication Progress– Maintain polling nature (where possible) of MPI progress to allow
low latency
Rendezvous Protocol: Design Alternatives
• MPI specification states that receiver may post a buffer larger than actual message
• Only sender knows the actual size of the message and can make the optimal decision on the protocol to be used:– Eager (buffered) if message is small– Rendezvous (un buffered) if message is large
• The Rendezvous protocol must be initiated by sender
RDMA Write Vs. RDMA Read
• RDMA Read based protocol need less control messages• Sender can embed its buffer information with
RNDZ_START message
ReceiverSender
RNDZ_START
RNDZ_REPLY
DATA
FIN
ReceiverSender
RNDZ_START
DATA
FIN
��� ����� ����� �������� ��� ���� ����� ��������
RDMA Read with Interrupt
• Interrupt triggers communication progress• This enables overlap of computation and
communication on receiver side• Need to reduce overhead caused by Interrupts
ReceiverSender
RNDZ_START
DATA
FIN
MPI_Isend MPI_Irecv
Compute
Comm
MPI_WaitCompute
Comm
MPI_Wait
��� ���� ����� ��������
Interrupt
Compute
Comm
ReceiverSender
RNDZ_START
DATA
FIN
MPI_Isend MPI_Irecv
ComputeComm
MPI_Wait
MPI_Wait
��� ���� ���� ��������� ����� ��������
Interrupt Reduction Techniques
• Selective Interrupts– Only RNDZ_START messages cause interrupts
• Interrupt Suppression– Interrupt handler once awake, handles as many RNDZ_START
messages it can find– Back-to-back messages don’t cause interrupts
• Dynamic Interrupt Requests– Interrupts enabled only when large receives are posted– Unexpected RNDZ_START messages don’t cause interrupts
Hybrid Communication Progress
• Progress engine has an impact on MPI performance
ProgressRate
High Good
Latency
InterruptBased
Low BadPollingBased
Low GoodHybrid
• Hybrid progress engine allows two progress threads to simultaneously execute
• In event of no “progress critical” events, no extra interrupts are generated
• Progress engine was re-designed to be thread safe
Presentation Outline
� Introduction and Motivation
� Problem Statement� Detailed Design Description
� Design Evaluation Framework� Micro-benchmark Level Evaluation
� Application Level Evaluation� Conclusions and Future Work
OSU MPI over InfiniBand• High Performance Implementations
– MPI-1 (MVAPICH)– MPI-2 (MVAPICH2)
• Open Source (BSD licensing)• Has enabled a large number of production IB clusters all over the
world to take advantage of IB– Largest being Sandia Thunderbird Cluster (4000 node with 8000 processors)
• Have been directly downloaded and used by more than 335 organizations worldwide (in 33 countries)
– Time tested and stable code base with novel features
• Available in software stack distributions of many vendors• Available in the OpenIB/gen2 stack• More details at
http://nowlab.cse.ohio-state.edu/projects/mpi-iba/
Evaluation Framework
• Proposed designs were incorporated in MVAPICH 0.9.5– RDMA Write (RDMA-W)– RDMA Read (RDMA-R)– RDMA Read with Interrupt (RDMA-RI)
• RDMA-R protocol is available from version 0.9.6• RDMA-RI protocol will be available from version
0.9.8
Presentation Outline
� Introduction and Motivation
� Problem Statement� Detailed Design Description
� Design Evaluation Framework� Micro-benchmark Level Evaluation
� Application Level Evaluation� Conclusions and Future Work
Experimental Evaluation
• Micro-benchmark tests– Computation/Communication overlap performance– Communication progress performance
• Measured with time stamps from overlap test
• Evaluation platforms– Cluster A: 8 Dual 3.0 GHz SMP; 2GB RAM; PCI-X– Cluster B: 32 Dual 2.6 GHz SMP; 2GB RAM; PCI-X
• Mellanox InfiniBand adapters (MT23108)• Mellanox 144 port InfiniBand switch (MTS14400)
Micro-benchmark Tests• Sender Overlap:
���������� ������������������������ ���
WT
����!� �� ����
• Receiver Overlap:
����� � �� �������������������� � � ���
WT����"���� ����
• Computation/Communication ratio is: W/T
Computation/Communication Overlap Performance
0
0.2
0.4
0.6
0.8
1
1.2
50 100 150 200 250 300 350 400
Compute Time (us)
Ove
rlap
Rat
io
RDMA-WRDMA-RRDMA-RI
0
0.2
0.4
0.6
0.8
1
1.2
50 100 150 200 250 300 350 400
Compute Time (us)
Ove
rlap
Rat
io
RDMA-WRDMA-RRDMA-RI
������ ������ ����������� �� !�" ������� ������ ����������� �� !�"
• Sender Overlap:– RDMA-W has poor overlap due to inability to discover the RNDZ_REPLY
message till computation is over– RDMA-R and RDMA-RI achieve nearly complete overlap
• Receiver Overlap:– RDMA-W and RDMA-R have poor overlap due to their inability to discover the
rendezvous control (RNDZ_REPLY and RNDZ_START) messages respectively– RDMA-RI achieves nearly complete overlap
Communication Progress Performance
1500 3000 4500 6000
ComputeComm
Wait
1500 3000 4500 6000
ComputeComm
Wait
1500 3000 4500 6000
ComputeComm
Wait
1500 3000 4500 6000
ComputeComm
Wait
• Time stamps are taken during sender/receiver overlap tests when application enters compute/communication phase and from within MPI library when application enters MPI_Wait
• The RDMA-RI can achieve 50% faster communication in both sender and receiver overlap tests
1500 3000 4500 6000
ComputeComm
Wait
1500 3000 4500 6000
ComputeComm
Wait
RDMA-W
RDMA-R
RDMA-RI
Time TimeSender Overlap Receiver Overlap
Presentation Outline
� Introduction and Motivation
� Problem Statement� Detailed Design Description
� Design Evaluation Framework� Micro-benchmark Level Evaluation
� Application Level Evaluation� Conclusions and Future Work
Application level Evaluation
• Two well known applications– High Performance Linpack (HPL)– NAS Scalar Pentadiagonal (SP)
• Predominantly use MPI_Isend/Irecv• Time spent in MPI library is profiled using mpiP
(a lightweight MPI profiling tool)• This wait time can be effectively utilized by
application to compute rather than just waiting for network operations to complete
Application Level Results
0
5
10
15
20
25
16 24 32
Number of Processes
Tim
e (s
eco
nd
s)
RDMA-WRDMA-RRDMA-RI
0
100
200
300
400
500
600
16 25 36
Number of Processes
Tim
e (s
eco
nd
s)
RDMA-WRDMA-RRDMA-RI
���#���� ���� ��� $�% ���#���� ���� ��� &�� ��
• Wait time for HPL– Reduced by ~30% for 32 processes by RDMA-R and RDMA-RI
• Wait time for NAS SP– Reduced by ~28% for 36 processes by RDMA-R and RDMA-RI
Presentation Outline
� Introduction and Motivation
� Problem Statement� Detailed Design Description
� Design Evaluation Framework� Micro-benchmark Level Evaluation
� Application Level Evaluation� Conclusions and Future Work
Conclusions and Future Work
• New designs can achieve nearly complete overlapof computation/communication
• Communication progress can be sped up by 50%• Application (HPL, NAS SP) wait times reduced by
30% and 28% respectively• Unique study of Rendezvous Protocol and its effect
on Computation/Communication overlap using RDMA
• Future Work– More exhaustive application oriented study on larger scale
InfiniBand cluster
30
Acknowledgements�������������������� ������ ������������������ ���
� ������ �������������� ���
� ������ ������ �� ������� ���
31
Web Pointers
������������ �����������������
� � � ���� �� ������������������ ���������������������� ��� ������