Upload
truongduong
View
218
Download
1
Embed Size (px)
Citation preview
Designing Next Generation Clusters: Evaluation of InfiniBand
DDR/QDR on Intel Computing Platforms
Hari Subramoni, Matthew Koop and Dhabaleswar. K. Panda
Computer Science & Engineering DepartmentThe Ohio State University
HotI '09
Outline
• Introduction and Motivation• Background• Experimental Setup• Microbenchmark Level Evaluation• Communication Balance• Application Level Evaluation• Conclusions and Future Work
HotI '09
Motivation• Commodity clusters becoming more popular• Balance needs to be maintained between
– Computational and I/O capabilities– Intra-Node and Inter-Node communication performance
• This balance keeps changing with advent of new processor and interconnect technologies
• With the advent of Intel’s latest Nehalem series of processors the balance has changed again
• Need exists to qualitatively and quantitatively explore how the new Nehalem processor, along with high speed InfiniBand interconnects, affects communication balance
HotI '09
Outline
• Introduction and Motivation• Background• Experimental Setup• Microbenchmark Level Evaluation• Communication Balance• Application Level Evaluation• Conclusions and Future Work
HotI '09
Background• The Intel Nehalem Processor
– First true quad core processor with L2 cache sharing
– 45 nm manufacturing process– Integrated memory controller
supporting multiple memory channels gives very high memory bandwidth
– Uses QuickPath Interconnect Technology
– HyperThreading allows execution of multiple threads per core in a seamless manner
– Turbo boost technology allows automatic over clocking of processors
HotI '09
NehalemCore 0
32 KBL1D Cache
256 KB L2 Cache
8MB L3 Cache
DDR3 MemoryController
QuickPathInterconnect
NehalemCore 3
32 KBL1D Cache
256 KB L2 Cache
Background (Cont)
• InfiniBand Architecture– An industry standard for low latency, high bandwidth, System
Area Networks– Two communication types
• Channel Semantics • Memory Semantics
– High data rates• Quad Data Rate (QDR) - 40 Gbps• Dual Data Rate (DDR) - 20 Gbps
– Multiple virtual lanes– Capable of end-to-end QoS
HotI '09
Outline
• Introduction and Motivation• Background• Experimental Setup• Microbenchmark Level Evaluation• Communication Balance• Application Level Evaluation• Conclusions and Future Work
HotI '09
Experimental Testbed
• Experimental Platforms– Intel Clovertown
• Intel Xeon E5345 Dual quad-core processors operating at 2.33 GHz• 6GB RAM, 4MB cache• PCIe 1.1 interface
– Intel Harpertown• Dual quad-core processors operating at 2.83 GHz • 8GB RAM• PCIe 2.0 interface
– Intel Nehalem• Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz• 12GB RAM, 8MB cache• PCIe 2.0 interface
HotI '09
Experimental Testbed (Cont)• InfiniBand Host Channel Adapters
– Dual port ConnectX DDR adapter– Dual port ConnectX QDR adapter
• InfiniBand Switches– Flextronics 144 port DDR switch– Mellanox 24 port QDR switch
• Open Fabrics Enterprise Distribution (OFED) 1.4.1 drivers• Red Hat Enterprise Linux 4U4• MPI Stack used – MVAPICH2-1.2p1• Legends
– NH-QDR – Intel Nehalem machines using ConnectX QDR HCA’s– NH-DDR – Intel Nehalem machines using ConnectX DDR HCA’s– HT-QDR – Intel Harpertown machines using ConnectX QDR HCA’s– HT-DDR – Intel Harpertown machines using ConnectX DDR HCA’s– CT-DDR – Intel Clovertown machines using ConnectX DDR HCA’s
HotI '09
MVAPICH / MVAPICH2 Software
• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)
– Used by more than 960 organizations in 51 countries
– More than 31,000 downloads from OSU site directly
– Empowering many TOP500 clusters• 8th ranked 62,976-core cluster (Ranger) at TACC
– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)
– Also supports uDAPL device to work with any network supporting uDAPL
– http://mvapich.cse.ohio-state.edu/
HotI '09
List of Benchmarks• OSU Microbenchmarks (OMB)
– Version 3.1.1– http://mvapich.cse.ohio-state.edu/benchmarks/
• Intel Collective Microbenchmarks (IMB)– Version 3.2– http://software.intel.com/en-us/articles/intel-mpi-benchmarks/
• HPC Challenge Benchmark (HPCC)– Version 1.3.1– http://icl.cs.utk.edu/hpcc/
• NAS Parallel Benchmarks (NPB)– Version 3.3– http://www.nas.nasa.gov/
HotI '09
Outline
• Introduction and Motivation• Background• Experimental Setup• Microbenchmark Level Evaluation• Communication Balance• Application Level Evaluation• Conclusions and Future Work
HotI '09
Microbenchmark Level Evaluation – Inter-Node Latency
00.5
11.5
22.5
33.5
44.5
Late
ncy
(us)
Message Size (Bytes)
HT-QDR HT-DDR NH-QDRNH-DDR CT-DDR
0200400600800
1000120014001600
Late
ncy
(us)
Message Size (Bytes)
HT-QDR HT-DDR NH-QDRNH-DDR CT-DDR
•Higher small message latency for Nehalem systems due to slower clock rate of machines in our cluster•Medium to large message shows true capacity of Nehalem systems
HotI '09
1.55
1.05
1.63
1.05
1.76
Inter-Node Bandwidth
0500
100015002000250030003500
Ban
dwid
th (M
Bps
)
Message Size (Bytes)
HT-QDR HT-DDR NH-QDRNH-DDR CT-DDR
3029
1943
1556
2575
01000200030004000500060007000
Bid
irect
iona
l Ban
dwid
th (M
Bps
)Message Size (Bytes)
HT-QDR HT-DDR NH-QDRNH-DDR CT-DDR
1943
HotI '09
5236
3870
3011
5042
3743
•NH-QDR gives up to 18% improvement in uni-directional bandwidth over HT-QDR
Intra-Node Latency
00.10.20.30.40.50.60.70.80.9
1
Late
ncy
(us)
Message Size (Bytes)
HT-QDR HT-DDR NH-QDRNH-DDR CT-DDR
0200400600800
100012001400160018002000
Late
ncy
(us)
Message Size (Bytes)
HT-QDR HT-DDR NH-QDRNH-DDR CT-DDR
HotI '09
0.31
0.490.52
0.49
0.31
•Nehalem systems give up to 45% improvement in Intra-Node latency
Intra-Node Bandwidth
0
1000
2000
3000
4000
5000
6000
Ban
dwid
th (M
Bps
)
Message Size (Bytes)
HT-QDR HT-DDR NH-QDRNH-DDR CT-DDR
4542
3038
1793
30514563
0100020003000400050006000700080009000
Bid
irect
iona
l Ban
dwid
th (M
Bps
)Message Size (Bytes)
HT-QDR HT-DDR NH-QDRNH-DDR CT-DDR
•Intra-Node bandwidth and bidirectional bandwidth shows the high memory bandwidth of Nehalem systems•Drop in performance for Nehalem systems at large message size due to cache collisions
HotI '09
6235
20621724
2119
6235
Collective PerformanceAlltoall Latency
0
50
100
150
200
250
300
350
Late
ncy
(us)
Message Size (Bytes)NH-QDR NH-DDR CT-DDR
0100000200000300000400000500000600000700000800000900000
1000000
Late
ncy
(us)
Message Size (Bytes)NH-QDR NH-DDR CT-DDR
•The 32 process Alltoall latency numbers shows a 43% to 55% improvement by using QDR HCA over a DDR HCA •Harpertown numbers not shown due to insufficient number of nodes
HotI '09
72.1
328.4
160.2
Outline
• Introduction and Motivation• Background• Experimental Setup• Microbenchmark Level Evaluation• Communication Balance• Application Level Evaluation• Conclusions and Future Work
HotI '09
Communication Balance Ratio
• Intra-Node communication gives better performance than Inter-Node communication for small to medium sized messages
• This make physical location of process very important in parallel computing
• A good super computing cluster is one where Intra-Node performance is same as Inter-Node performance
HotI '09
Communication Balance RatioLatency
00.20.40.60.8
11.21.41.6
Com
mun
icat
ion
Bal
ance
Rat
io
(L_i
ntra
/L_i
nter
)
Message Size (Bytes)HT-QDR HT-DDR NH-QDR NH-DDR CT-DDR Balanced
•The black line indicates the ratio displayed by a balanced system
HotI '09
Communication Balance Ratio Bandwidth
0
0.5
1
1.5
2
2.5
3
3.5
Com
mun
icat
ion
Bal
ance
Rat
io
(BW
_int
ra/B
W_i
nter
)
Message Size (Bytes)HT-QDR HT-DDR NH-QDR NH-DDR CT-DDR Balanced
•The black line indicates the ratio displayed by a balanced systemHotI '09
Communication Balance Ratio Bidirectional Bandwidth
0
0.5
1
1.5
2
2.5
Com
mun
icat
ion
Bal
ance
Rat
io
(Bi_
BW
_int
ra/B
i_B
W_i
nter
)
Message Size (Bytes)HT-QDR HT-DDR NH-QDR NH-DDR CT-DDR Balanced
HotI '09
Communication Balance Ratio Multipair Bandwidth
02468
10121416
Com
mun
icat
ion
Bal
ance
Rat
io
(MP_
BW
_int
ra/M
P_B
W_i
nter
)
Message Size (Bytes)HT-QDR HT-DDR NH-QDR NH-DDR CT-DDR Balanced
HotI '09
Outline
• Introduction and Motivation• Background• Experimental Setup• Microbenchmark Level Evaluation• Communication Balance• Application Level Evaluation• Conclusions and Future Work
HotI '09
Application Level Evaluation –HPCC
0
500
1000
1500
2000
2500
3000
Minimum Ping Pong Bandwidth
Ban
dwid
th (M
Bps
)
HT-QDR HT-DDR NH-QDRNH-DDR Baseline
HotI '09
•Baseline numbers are taken on CT-DDR
•NH-DDR show a 13%improvement in performance over Harpertown and Clovertownsystems
•NH-QDR shows a 38% improvement in performance over NH-DDR systems
Application Level Evaluation –HPCC (Cont)
0100200300400500600700800
Naturally Ordered Ring Bandwidth
Ban
dwid
th (M
Bps
)
HT-QDR HT-DDR NH-QDRNH-DDR Baseline
HotI '09
0100200300400500600700800
Randomly Ordered Ring Bandwidth
Ban
dwid
th (M
Bps
)HT-QDR HT-DDR NH-QDRNH-DDR Baseline
•Upto 190% improvement in Naturally Ordered Ring bandwidth for NH-QDR
•Upto 130% improvement in Randomly Ordered Ring bandwidth for NH-QDR
Performance of NAS Benchmarks
00.20.40.60.8
11.2
CG FT IS LU MG
Nor
mal
ized
Tim
e
NAS Benchmarks
Class B – 32 processes
NH-DDR NH-QDR
00.20.40.60.8
11.2
CG FT IS LU MG
Nor
mal
ized
Tim
eNAS Benchmarks
Class C – 32 processes
NH-DDR NH-QDR
HotI '09
•NH-QDR shows clear benefits over NH-DDR for multiple applications
Outline
• Introduction and Motivation• Background• Experimental Setup• Microbenchmark Level Evaluation• Communication Balance• Application Level Evaluation• Conclusions and Future Work
HotI '09
Conclusions & Future Work
• Evaluate multiple computational platforms with multiple InfiniBand HCAs
• Nehalem systems with QDR interconnect gives best performance
• Propose the metric of communication balance to find out the best components to design next generation clusters
• Nehalem systems with QDR interconnects offers best communication balance
• We plan to perform larger scale evaluations and study impact of these systems on performance of end applications
HotI '09