Performance and Application Power on GPUs with Dell 12G …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · 17 Dell HPC Dell Power R720 First Standard Server with

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms

Dr. Jeffrey Layton

Enterprise Technologist – HPC

Dell HPC 2

Why GPUs?

• GPUs have very high peak compute capability ! – 6 - 9X CPU

• Challenges › How feed enough data ?

› Need to port applications !

• Example: Tesla M2090 GPU – Cores 512

– Memory 6 GB

– Memory BW 177.6 GB/s

• Peak Performance – Single Precision 1331 GFLOPs

– Double Precision 665 GFLOPs

Tesla

M2090

GPU

Dell HPC 3

How can they be used?

• Inside the server – Limited Space, few can fit

– Limited Power, few can run

– Difficult to replace

• Outside the server – Pros:

› Flexibility, Multiple GPUs

› GPUs can be shared

› Multiple Host Servers

– Cons: › Oversubscription may limit performance

M610x

C410x

GPU

Host

Dell HPC 4

The Problem: Best Design Parameters are Unknown

• How many GPUs per server is ideal for my application ?

• How much bandwidth do I need per GPU for a typical users ?

• How does performance scale with increasing number of nodes?

• How does performance scale with increasing number of GPUs/node ?

• What problem size is most suitable for GPU computing ?

• What is the impact on power consumption and performance/watt?

• Etc. Etc.

• Even if you know some of your design parameters ,

• They may change due to improved GPUs, CPUs, GPU drivers, Software/Algorithm redesign etc.

Dell HPC

GPU Enabled Product Portfolio

Dell HPC 6

Overview

• GPU enabled products throughput the portfolio

• Learn on a laptop

• Develop/Test on workstations

• Production on servers

Dell HPC 7

Laptops Learn GPU programming

• Buisness laptop: E6520

– Nvidia NVS 4200M

› 48 CUDA cores

› 512MB memory

• XPS 15:

– Nvidia Geforce GT540M

› 96 CUDA cores

› 2GB memory

Dell HPC 8

Workstations Develop/Tune Applications

• Portable Workstation – M6600 – 17” display (1920 x 1080)

– Up to 16GB memory

– Intel quad-core i7 (Ivy Bridge coming soon)

– Up to 3 hard drives

– Quadro 3000M

› 240 CUDA cores

› 2GB GDDR5 memory

– Quadro 5010M

› 96 CUDA cores

› 4GB GDDR5 memory

• T7500: – Tower Case

– Dual-socket Intel Westmere (X56-- processors)

– Up to 192GB memory (12 DIMM slots)

– Two PCIe Gen2 x16 slots

– Five internal SATA or SAS drives › RAID cards available

– GigE on-board, optional 10GigE cards

Dell HPC 9

Rackable Workstation Develop/Tune

• R5500:

– 2U rackable workstation

– Dual-socket Westmere

– 12 DIMM slots (up to 192GB)

– Up to 5 SATA or 6 SAS drives (2.5”)

– Two PCIe Gen2 x16 slots

› Tesla C2070 is an example

Dell HPC 10

Dell M610x blade

• Half-height blade (5U)

• 2S Westmere

• 12 DIMM slots (192GB memory)

• Mezz card for QDR IB, 10GigE

• One double-wide GPU per blade

Dell HPC 11

Dell PowerEdge C410x – Power & Flexibility

• Basically, “Room and board” for 16 GPUs • Theoretical Max. of 16.5 TFLOPs

• Connects up to 8 hosts

• Connects up to 16 PCIe Gen-2 devices (GPGPUs) to hosts

• High density, 3U chassis.

• Flexibility to selecting number of GPGPUs

• Individually serviceable modules

• N+1 1400W Power supplies (3+1)

• N+1 92mm Cooling fans (7+1)

• PCIe switches

• 8 PEX 8647

• 4 PEX 8696

Dell HPC

Dell PowerEdge C410x

• Sixteen (16) x16 Gen-2 Modules

- PCIe Gen-2 x16 compliant - Independently serviceable

Board-to-board connector for X16 Gen 2 PCIe signals and power

Power connector for GPGPU card

GPU card

LED and On/Off

Dell Research Computing

1

2

Dell HPC 13

Learn more about Dell PEC C410x

• How can you dynamically allocate GPUs to host nodes using the Dell PEC C410x?

• Learn more at session S0309 (Thursday 10:30 Room K)

– Dynamically Allocating GPGPU to Host Nodes (servers)

Dell HPC 14

Dell PEC C6100: 4 2S in one chassis

Four 2-Socket Nodes in 2U Intel Westmere-EP

Each Node: 12 DIMMs each

2 GigE (Intel)

1 Daughter Card (PCIe x8)

QDR IB or 10GigE

One PCIe x16 (half-length, half-height)

Optional SAS controller (in-place of IB)

Chassis Design: Hot Plug, Individual Nodes

Up to 12 x 3.5” drives (3 per node)

Up to 24 x 2.5” drives (6 per node)

N+1 Power supplies (1100W or 1400W)

NVIDIA HIC certified

Dell HPC 15

Dell PEC C6145: Two AMD 4S in 2U

• Two 4-Socket Nodes in 2U • 4S AMD Opteron 6200 series

• Each Node: • 32 x DDR3 RDIMMs

• 2 x GbE Intel

• 1 x8 Gen II (custom mezzanine slot)

• QDR IB or 10GigE

• 3 x16 Gen II (low-profile, half height/half-length)

• Chassis Design: • Hot Plug, Individual Nodes

• 24 x 2.5” or 12 x 3.5” HDD

• Redundant Power supplies (1100W or 1400W)

• Embedded x16 HIC and slots for additional HICs

Dell HPC 16

Dell PEC C6220 Four 2S Sandy Bridge in 2U

Four 2-Socket Nodes in 2U Intel Sandy Bridge-EP

Each Node: 16 DIMMs each

2 GigE (Intel)

1 Daughter Card (PCIe Gen 3 x8)

FDR IB or QDR IB or 10GigE

One PCIe G3 x16 (half-length, half-height)

Two node version has two PCIe G3 x16 slots

Optional SAS controller (in-place of IB)

Chassis Design: Hot Plug, Individual Nodes

Up to 12 x 3.5” drives or Up to 24 x 2.5” drives

N+1 Power supplies (1100W or 1400W)

NVIDIA HIC certified

Dell HPC 17

Dell Power R720 First Standard Server with internal GPUs

Two-socket Intel Sandy Bridge-EP 24 DIMM slots (up to 768GB) Dell Select Network Adapters:

4x GigE

2x10GigE + 2xGigE

Intel or Broadcom

7 PCIe Gen 3 slots: Up to two internal GPUS (passive)

PCIe G3 x8 slot for network adapter (e.g. FDR IB)

Up to 4 front-access, hot-swap, PCIe drives Up to 16 drives HIC certified to work with Dell C410x

Dell HPC

Performance and Power Measurements

Dell HPC 19

Dell HPC Engineering

• Performing benchmarks/tests of various GPU applications: – Different host nodes

› C6100, C6145 are called Dell 11G

› C6220 and R720 are called Dell 12G

– Different number of GPUs

– Internal and external GPUs

• Measured performance AND power during tests

• Goal is to understand how applications scale: – Number and type of GPUs

– Host node configuration

• Develop best practices for GPU configurations

Dell HPC 20

Applications

• HPL

• NAMD

• XFDTD

• 3D Oil Reservoir Simulation

• ANSYS Mechanical

Dell HPC 21

Thanks!!!!!

• Dr. Saeed Iqbal

• Shawn Gao

• Onur Celebioglu

• Mark Fernandez, Glen Otero

• Nvidia –Mass Fatica, Stan Posey, Peter Lillian, Bob Cravella,

Travis Wells, et al

Dell HPC

Dell 11G

Dell HPC 23

Dell PowerEdge C6100 + C410x HPL Normalized Results

93.63%

64.91%

56.33%

39.93%

18.18%

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

CPUOnly

CPU + 1M2070

CPU + 2M2070

CPU + 4M2070

CPU + 8M2070

No

rma

liz

ed

Pe

rfo

rma

nc

e (

GF

LO

PS

)

93.63%

64.91%

56.33% 39.93% 18.18%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

CPU Only CPU + 1M2070

CPU + 2M2070

CPU + 4M2070

CPU + 8M2070

Nro

ma

liz

ed

Po

we

r (W

)

93.63%

64.91%

56.33%

39.93%

18.18%

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

CPUOnly

CPU + 1M2070

CPU + 2M2070

CPU + 4M2070

CPU + 8M2070

No

rma

liz

ed

GF

LO

PS

/W

• One HIC from host node to C410x

• Recommended no more than 2 GPUs per HIC

1-node PE C6100, Dual [email protected], 48GB, 1333MHz Memory; C410x has Nvidia M2070’s

Dell HPC

Dell PowerEdge C6145 + C410x HPL Normalized Results

24

84.06%

45.11%

19.74%

12.03%

32.02%

20.11%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00


CPU + 2M2070

CPU + 4M207

CPU + 8M2070

No

rma

liz

ed

Pe

rfo

rma

nc

e (

GF

LO

PS

)

1 HIC

2 HIC

84.06%

45.11%

35.85%

19.74%

12.03%

46.36%

32.02%

20.11%

0.00

0.20

0.40

0.60

0.80

1.00

1.20


CPU + 2M2070

CPU + 4M207

CPU + 8M2070

No

rma

liz

ed

GF

LO

PS

/W

1 HIC

2 HIC

1-node PE C6145, Four [email protected], 128GB, 1333MHz Memory; C410x has Nvidia M2070’s

• One or Two HICs from host node to C410x

• Recommend no more than 2 GPUs per HIC (1 per HIC is better)

Dell HPC 25

Dell PowerEdge C6100 + C410x NAMD Normalized Results

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

CPUOnly

CPU + 1M2070

CPU + 2M2070

CPU + 4M2070

CPU + 8M2070

No

rma

liz

ed

Pe

rfo

rma

nc

e (

da

y/n

s)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00


CPU + 2M2070

CPU + 4M2070

CPU + 8M2070

No

rma

liz

ed

Po

we

r (W

)

1-node PE C6100, Dual [email protected], 48GB, 1333MHz Memory; C410x has Nvidia M2070’s

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60


CPU + 2M2070

CPU + 4M2070

CPU + 8M2070

No

rma

liz

ed

Pe

rf (

1/d

ay

s/n

s)/W

• STMV data set (1M atoms)

• Two GPUs is best, 4 GPUs is also good (Perf/W). 1 HIC is good

Dell HPC 26

Dell PowerEdge C6100 + C410x XFDTD Normalized Performance

Node PE C6100, Dual [email protected], 48GB, 1333MHz Memory; C410x has 16 M2070 GPUs

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

CPU Only CPU + 1 M2070 CPU + 2 M2070 CPU + 4 M2070 CPU + 6 M2070 CPU + 8 M2070

No

rma

liz

ed

Pe

rfo

rma

nc

e (

1/t

ime

)

• 2-4 GPUs are best

• 2-4 GPUs per x16 HIC

Dell HPC 27

Dell PowerEdge C6100 + C410x 3D Oil Reservoir Simulation (Elastic Model) Normalized Results

CPU CPU CPU CPU

2 GPUs 2 GPUs

4 GPUs

8 GPUs

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

8.28M/2GPUS

16.57M/2GPUs

33.15M/4GPUs

66.34M/8GPUs

No

rma

liz

ed

Pe

rfo

rma

nc

e (

tim

e)

Problem Size/Number of GPUs

CPU CPU CPU CPU

2 GPUS 2 GPUS

4 GPUs

8 GPUs

0.00

1.00

2.00

3.00

4.00

5.00

6.00

8.28M/2GPUS

16.57M/2GPUs

33.15M/4GPUs

66.34M/8GPUs

No

rma

liz

ed

Po

we

r (W

)


CPU CPU CPU CPU

2 GPUs 2 GPUs

4 GPUs

8 GPUs

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

8.28M/2GPUS

16.57M/2GPUs

33.15M/4GPUs

66.34M/8GPUs

No

rma

liz

ed

Pe

rf/W

(T

ime

*W)


Node PE C6100, Dual [email protected], 48GB, 1333MHz Memory; C410x has 16 M2070 GPUs

• Multiple data sets

• Smaller data sets: 2 GPUs. Larger data sets: 4-8 GPUs (per x16 HIC)

Dell HPC

Dell 12G

Dell HPC 29

Dell PowerEdge R720 HPL-Normalized Results

94.1%

67.8%

60.2%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

R720 CPUonly

R720 + 1M2090

R720 + 2M2090

No

rma

liz

ed

Pe

rfo

rma

nc

e (

GF

LO

PS

)

94.1%

67.8%

60.2%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

R720 CPU only R720 + 1M2090

R720 + 2M2090

No

rma

liz

ed

Po

we

r (W

)

94.1%

67.8%

60.2%

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

R720 CPU only R720 + 1M2090

R720 + 2M2090

No

rma

liz

ed

GF

LO

PS

/W

R720, Dual Intel [email protected], 64GB, 1333MHz Memory (8x 8GB); two internal M2090 GPUs

• Internal GPUs

• 2 GPUs seems like a good configuration (individual x16 slot)

Dell HPC

Compare GFLOPS/watt (Cap 225W) Comparisons ( C6100 & R720)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

0 1 2

GF

LO

Ps/

Wa

tts

M2090 GPUs with 225W power capping

C6100 ( X5670 @ 2.93GHz )

R720 ( E5 - 2680 @ 2.7GHz )

R720 ( Intel @ 2.2GHz )

Dell HPC 31

Dell PowerEdge C6220 + C410x HPL - Normalized Results

92.7%

67.2%

58.7% 34.4%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

CPU only CPU+1M2090

CPU+2M2090

CPU+4M2090

No

rma

liz

ed

Pe

rfo

rma

nc

e (

GF

LO

PS

)

92.7%

67.2%

58.7%

34.4%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

CPU only CPU+1M2090

CPU+2M2090

CPU+4M2090

No

rma

liz

ed

Po

we

r (W

)

92.7% 67.2%

58.7% 34.4%

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

CPU only CPU+1M2090

CPU+2M2090

CPU+4M2090

No

rma

liz

ed

GF

LO

PS

/W

• External GPUs with single HIC

• Two GPUs seems to be sweet spot (two GPUs with 1 x16 HIC)

Dell HPC 32

Comparison of Internal vs. External HPL - Normalized Performance

• Not quite apples-to-apples (PCIe G2 to G3)

94.1%

67.8%

60.2%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

R720 CPU only R720 + 1 M2090 R720 + 2 M2090

No

rma

liz

ed

Pe

rfo

rma

nc

e (

GF

LO

PS

)

92.7%

67.2%

58.7% 34.4%

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

CPU only CPU+1 M2090 CPU+2 M2090 CPU+4 M2090

No

rma

liz

ed

Pe

rfo

rma

nc

e (

GF

LO

PS

)

Dell HPC 33

Comparison of Internal vs. External HPL - Normalized GFLOPS/W

94.1%

67.8%

60.2%

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

R720 CPU only R720 + 1 M2090 R720 + 2 M2090

No

rma

liz

ed

GF

LO

PS

/W

92.7% 67.2%

58.7% 34.4%

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

CPU only CPU+1 M2090 CPU+2 M2090 CPU+4 M2090

No

rma

liz

ed

GF

LO

PS

/W

• Not quite apples-to-apples (PCIe G2 to G3)

• Internal GPUs are more effective (perf/W) but external are still very good

Dell HPC

Dell PowerEdge R720 ANSYS Mechanical - Normalized Results – 1 core

34

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

No

rma

liz

ed

Pe

rfo

rma

nc

e (

Tim

e)

1 Core

1 Core + 1 M2090

0.00

0.50

1.00

1.50

2.00

2.50

V14cg-1V14sp-1V14sp-2V14sp-3V14sp-4V14sp-5V14sp-6

No

rma

liz

ed

Pe

rf/W

- (

1/t

ime

)/W

1 Core

1 Core + 1 M2090

• Massive speedup with GPU (up to 3.6x) but not all cases

• Perf/W (efficiency) can be quite good (2 times better)

Dell HPC

Dell PowerEdge R720 ANSYS Mechanical - Normalized Results – 2 cores

35

0.00

0.50

1.00

1.50

2.00

2.50

3.00

No

rma

liz

ed

Pe

rfo

rma

nc

e (

Tim

e)

2 Cores

2 Cores + 1 M2090

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

No

rma

liz

ed

Pe

rf/W

- (

1/t

ime

)/W

2 Cores

2 Cores + 1 M2090

• Less speedup than 1 core case

• Perf/W (efficiency) is still very good

Dell HPC


36

0

0.5

1

1.5

2

2.5

3


No

rma

liz

ed

Pe

rfo

ram

nc

e (

Tim

e)

4 Cores

4 Cores + 1 M2090

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8


No

rma

liz

ed

Pe

rf/W

- (

1/T

)/W

4 Cores

4 Cores + 1 M2090

• Less speedup than 1 and 2 core cases

Dell HPC


37

0.00

0.50

1.00

1.50

2.00

2.50

No

rma

liz

ed

Pe

rfo

rma

nc

e (

Tim

e)

8 Cores

8 Cores + 1 M2090

8 Cores + 2 M2090

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

No

rma

liz

ed

Pe

rf/W

- (

1/T

)/W

8 Cores

8 Cores + 1 M2090

• Added 2 GPU tests (not much impact on performance)

• Efficiency is not good except for 2 cases (less than 1)

Dell HPC


38

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

No

rma

liz

ed

Pe

rfo

rma

nc

e (

Sp

ee

du

p)

16 Cores

16 Cores + 1 M2090

16 Cores + 2 M2090

0.00

0.20

0.40

0.60

0.80

1.00

1.20

No

rma

liz

ed

Pe

rf/W

- (

1/T

)/W

16 Cores

16 Cores + 1 M2090

• Very little speed improvement

• Efficiency with GPUs is worse than CPUs only

Dell HPC

Dell PowerEdge R720 ANSYS Mechanical – Trends

39

• Choose last 3 cases (best usage of GPUs)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

1 Core 2 Cores 4 Cores 8 Cores 16 Cores

No

rma

liz

ed

Pe

rfo

rma

nc

e (

1/T

)

V14sp-4

V14sp-5

V14sp-6

CPU Only

0.00

0.50

1.00

1.50

2.00

2.50

1 Core 2 Cores 4 Cores 8 Cores 16 Cores

No

rma

liz

ed

Pe

rf/W

- (

1/T

)/W

V14sp-4

V14sp-5

V14sp-6

CPU Only

Dell HPC 40

ANSYS Mechanical observations

• As the number of cores increased:

– The GPU speedup decreases

– Efficiency decreases (at 16 cores it’s not good)

– Cross-over point is 8 or 16 cores (8 is a good rule of thumb)

• Recommended configuration:

– Small CPU core count (no more than 4) with 1 GPU

› With 2 GPUs in node, you can run 2 cases at the same time (uses 8 cores)

– Performance varies with case (solver)

Dell HPC 41

Summary

• Lots of options for GPU configurations – which one is “best”? – How do you define “best”? Performance? Power efficiency? Both?

• Answers vary (depend upon application)

• You don’t have to have a dedicated x16 slot for each GPU for good performance and good efficiency – Many applications shown here illustrate this

• Dell C410x allows GPU Direct for up to 8 GPUs – Other systems do not allow this

Thanks! Questions?

www.dellhpcsolutions.com

www.hpcatdell.com

http://www.dellhpcsolutions.com/

http://www.hpcatdell.com/

Documents

Performance and Application Power on GPUs with Dell 12G …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF… · 17 Dell HPC Dell Power R720 First Standard Server with