Designing Next Generation Clusters: Evaluation of ... · PDF fileDesigning Next Generation Clusters: Evaluation of InfiniBand ... 4563. 0. 1000. 2000 3000. 4000. 5000. 6000. 7000 8000

Designing Next Generation Clusters: Evaluation of InfiniBand

DDR/QDR on Intel Computing Platforms

Hari Subramoni, Matthew Koop and Dhabaleswar. K. Panda

Computer Science & Engineering DepartmentThe Ohio State University

HotI '09

Outline

• Introduction and Motivation• Background• Experimental Setup• Microbenchmark Level Evaluation• Communication Balance• Application Level Evaluation• Conclusions and Future Work

HotI '09

Motivation• Commodity clusters becoming more popular• Balance needs to be maintained between

– Computational and I/O capabilities– Intra-Node and Inter-Node communication performance

• This balance keeps changing with advent of new processor and interconnect technologies

• With the advent of Intel’s latest Nehalem series of processors the balance has changed again

• Need exists to qualitatively and quantitatively explore how the new Nehalem processor, along with high speed InfiniBand interconnects, affects communication balance

HotI '09

Outline


HotI '09

Background• The Intel Nehalem Processor

– First true quad core processor with L2 cache sharing

– 45 nm manufacturing process– Integrated memory controller

supporting multiple memory channels gives very high memory bandwidth

– Uses QuickPath Interconnect Technology

– HyperThreading allows execution of multiple threads per core in a seamless manner

– Turbo boost technology allows automatic over clocking of processors

HotI '09

NehalemCore 0

32 KBL1D Cache

256 KB L2 Cache

8MB L3 Cache

DDR3 MemoryController

QuickPathInterconnect

NehalemCore 3

32 KBL1D Cache

256 KB L2 Cache

Background (Cont)

• InfiniBand Architecture– An industry standard for low latency, high bandwidth, System

Area Networks– Two communication types

• Channel Semantics • Memory Semantics

– High data rates• Quad Data Rate (QDR) - 40 Gbps• Dual Data Rate (DDR) - 20 Gbps

– Multiple virtual lanes– Capable of end-to-end QoS

HotI '09

Outline


HotI '09

Experimental Testbed

• Experimental Platforms– Intel Clovertown

• Intel Xeon E5345 Dual quad-core processors operating at 2.33 GHz• 6GB RAM, 4MB cache• PCIe 1.1 interface

– Intel Harpertown• Dual quad-core processors operating at 2.83 GHz • 8GB RAM• PCIe 2.0 interface

– Intel Nehalem• Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz• 12GB RAM, 8MB cache• PCIe 2.0 interface

HotI '09

Experimental Testbed (Cont)• InfiniBand Host Channel Adapters

– Dual port ConnectX DDR adapter– Dual port ConnectX QDR adapter

• InfiniBand Switches– Flextronics 144 port DDR switch– Mellanox 24 port QDR switch

• Open Fabrics Enterprise Distribution (OFED) 1.4.1 drivers• Red Hat Enterprise Linux 4U4• MPI Stack used – MVAPICH2-1.2p1• Legends

– NH-QDR – Intel Nehalem machines using ConnectX QDR HCA’s– NH-DDR – Intel Nehalem machines using ConnectX DDR HCA’s– HT-QDR – Intel Harpertown machines using ConnectX QDR HCA’s– HT-DDR – Intel Harpertown machines using ConnectX DDR HCA’s– CT-DDR – Intel Clovertown machines using ConnectX DDR HCA’s

HotI '09

MVAPICH / MVAPICH2 Software

• High Performance MPI Library for IB and 10GE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Used by more than 960 organizations in 51 countries

– More than 31,000 downloads from OSU site directly

– Empowering many TOP500 clusters• 8th ranked 62,976-core cluster (Ranger) at TACC

– Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED)

– Also supports uDAPL device to work with any network supporting uDAPL

– http://mvapich.cse.ohio-state.edu/

HotI '09

http://mvapich.cse.ohio-state.edu/�

List of Benchmarks• OSU Microbenchmarks (OMB)

– Version 3.1.1– http://mvapich.cse.ohio-state.edu/benchmarks/

• Intel Collective Microbenchmarks (IMB)– Version 3.2– http://software.intel.com/en-us/articles/intel-mpi-benchmarks/

• HPC Challenge Benchmark (HPCC)– Version 1.3.1– http://icl.cs.utk.edu/hpcc/

• NAS Parallel Benchmarks (NPB)– Version 3.3– http://www.nas.nasa.gov/

HotI '09

http://mvapich.cse.ohio-state.edu/benchmarks/�

http://software.intel.com/en-us/articles/intel-mpi-benchmarks/�

http://icl.cs.utk.edu/hpcc/�

http://www.nas.nasa.gov/�

Outline


HotI '09

Microbenchmark Level Evaluation – Inter-Node Latency

00.5

11.5

22.5

33.5

44.5

Late

ncy

(us)

Message Size (Bytes)

HT-QDR HT-DDR NH-QDRNH-DDR CT-DDR

0200400600800

1000120014001600

Late

ncy

(us)



•Higher small message latency for Nehalem systems due to slower clock rate of machines in our cluster•Medium to large message shows true capacity of Nehalem systems

HotI '09

1.55

1.05

1.63

1.05

1.76

Inter-Node Bandwidth

0500

100015002000250030003500

Ban

dwid

th (M

Bps

)



3029

1943

1556

2575

01000200030004000500060007000

Bid

irect

iona

l Ban

dwid

th (M

Bps

)Message Size (Bytes)


1943

HotI '09

5236

3870

3011

5042

3743

•NH-QDR gives up to 18% improvement in uni-directional bandwidth over HT-QDR

Intra-Node Latency

00.10.20.30.40.50.60.70.80.9

1

Late

ncy

(us)



0200400600800

100012001400160018002000

Late

ncy

(us)



HotI '09

0.31

0.490.52

0.49

0.31

•Nehalem systems give up to 45% improvement in Intra-Node latency

Intra-Node Bandwidth

0

1000

2000

3000

4000

5000

6000

Ban

dwid

th (M

Bps

)



4542

3038

1793

30514563

0100020003000400050006000700080009000

Bid

irect

iona

l Ban

dwid

th (M

Bps

)Message Size (Bytes)


•Intra-Node bandwidth and bidirectional bandwidth shows the high memory bandwidth of Nehalem systems•Drop in performance for Nehalem systems at large message size due to cache collisions

HotI '09

6235

20621724

2119

6235

Collective PerformanceAlltoall Latency

0

50

100

150

200

250

300

350

Late

ncy

(us)

Message Size (Bytes)NH-QDR NH-DDR CT-DDR

0100000200000300000400000500000600000700000800000900000

1000000

Late

ncy

(us)

Message Size (Bytes)NH-QDR NH-DDR CT-DDR

•The 32 process Alltoall latency numbers shows a 43% to 55% improvement by using QDR HCA over a DDR HCA •Harpertown numbers not shown due to insufficient number of nodes

HotI '09

72.1

328.4

160.2

Outline


HotI '09

Communication Balance Ratio

• Intra-Node communication gives better performance than Inter-Node communication for small to medium sized messages

• This make physical location of process very important in parallel computing

• A good super computing cluster is one where Intra-Node performance is same as Inter-Node performance

HotI '09

Communication Balance RatioLatency

00.20.40.60.8

11.21.41.6

Com

mun

icat

ion

Bal

ance

Rat

io

(L_i

ntra

/L_i

nter

)

Message Size (Bytes)HT-QDR HT-DDR NH-QDR NH-DDR CT-DDR Balanced

•The black line indicates the ratio displayed by a balanced system

HotI '09

Communication Balance Ratio Bandwidth

0

0.5

1

1.5

2

2.5

3

3.5

Com

mun

icat

ion

Bal

ance

Rat

io

(BW

_int

ra/B

W_i

nter

)


•The black line indicates the ratio displayed by a balanced systemHotI '09

Communication Balance Ratio Bidirectional Bandwidth

0

0.5

1

1.5

2

2.5

Com

mun

icat

ion

Bal

ance

Rat

io

(Bi_

BW

_int

ra/B

i_B

W_i

nter

)


HotI '09

Communication Balance Ratio Multipair Bandwidth

02468

10121416

Com

mun

icat

ion

Bal

ance

Rat

io

(MP_

BW

_int

ra/M

P_B

W_i

nter

)


HotI '09

Outline


HotI '09

Application Level Evaluation –HPCC

0

500

1000

1500

2000

2500

3000

Minimum Ping Pong Bandwidth

Ban

dwid

th (M

Bps

)

HT-QDR HT-DDR NH-QDRNH-DDR Baseline

HotI '09

•Baseline numbers are taken on CT-DDR

•NH-DDR show a 13%improvement in performance over Harpertown and Clovertownsystems

•NH-QDR shows a 38% improvement in performance over NH-DDR systems

Application Level Evaluation –HPCC (Cont)

0100200300400500600700800

Naturally Ordered Ring Bandwidth

Ban

dwid

th (M

Bps

)

HT-QDR HT-DDR NH-QDRNH-DDR Baseline

HotI '09

0100200300400500600700800

Randomly Ordered Ring Bandwidth

Ban

dwid

th (M

Bps

)HT-QDR HT-DDR NH-QDRNH-DDR Baseline

•Upto 190% improvement in Naturally Ordered Ring bandwidth for NH-QDR

•Upto 130% improvement in Randomly Ordered Ring bandwidth for NH-QDR

Performance of NAS Benchmarks

00.20.40.60.8

11.2

CG FT IS LU MG

Nor

mal

ized

Tim

e

NAS Benchmarks

Class B – 32 processes

NH-DDR NH-QDR

00.20.40.60.8

11.2

CG FT IS LU MG

Nor

mal

ized

Tim

eNAS Benchmarks

Class C – 32 processes

NH-DDR NH-QDR

HotI '09

•NH-QDR shows clear benefits over NH-DDR for multiple applications

Outline


HotI '09

Conclusions & Future Work

• Evaluate multiple computational platforms with multiple InfiniBand HCAs

• Nehalem systems with QDR interconnect gives best performance

• Propose the metric of communication balance to find out the best components to design next generation clusters

• Nehalem systems with QDR interconnects offers best communication balance

• We plan to perform larger scale evaluations and study impact of these systems on performance of end applications

HotI '09

Thank you !

{subramon, koop, panda}@cse.ohio-state.edu

Network-Based Computing Laboratoryhttp://mvapich.cse.ohio-state.edu/

HotI '09

Documents

Designing Next Generation Clusters: Evaluation of ... · PDF fileDesigning Next Generation Clusters: Evaluation of InfiniBand ... 4563. 0. 1000. 2000 3000. 4000. 5000. 6000. 7000 8000