Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms
Dr. Jeffrey Layton
Enterprise Technologist – HPC
Dell HPC 2
Why GPUs?
• GPUs have very high peak compute capability ! – 6 - 9X CPU
• Challenges › How feed enough data ?
› Need to port applications !
• Example: Tesla M2090 GPU – Cores 512
– Memory 6 GB
– Memory BW 177.6 GB/s
• Peak Performance – Single Precision 1331 GFLOPs
– Double Precision 665 GFLOPs
Tesla
M2090
GPU
Dell HPC 3
How can they be used?
• Inside the server – Limited Space, few can fit
– Limited Power, few can run
– Difficult to replace
• Outside the server – Pros:
› Flexibility, Multiple GPUs
› GPUs can be shared
› Multiple Host Servers
– Cons: › Oversubscription may limit performance
M610x
C410x
GPU
Host
Dell HPC 4
The Problem: Best Design Parameters are Unknown
• How many GPUs per server is ideal for my application ?
• How much bandwidth do I need per GPU for a typical users ?
• How does performance scale with increasing number of nodes?
• How does performance scale with increasing number of GPUs/node ?
• What problem size is most suitable for GPU computing ?
• What is the impact on power consumption and performance/watt?
• Etc. Etc.
• Even if you know some of your design parameters ,
• They may change due to improved GPUs, CPUs, GPU drivers, Software/Algorithm redesign etc.
Dell HPC
GPU Enabled Product Portfolio
Dell HPC 6
Overview
• GPU enabled products throughput the portfolio
• Learn on a laptop
• Develop/Test on workstations
• Production on servers
Dell HPC 7
Laptops Learn GPU programming
• Buisness laptop: E6520
– Nvidia NVS 4200M
› 48 CUDA cores
› 512MB memory
• XPS 15:
– Nvidia Geforce GT540M
› 96 CUDA cores
› 2GB memory
Dell HPC 8
Workstations Develop/Tune Applications
• Portable Workstation – M6600 – 17” display (1920 x 1080)
– Up to 16GB memory
– Intel quad-core i7 (Ivy Bridge coming soon)
– Up to 3 hard drives
– Quadro 3000M
› 240 CUDA cores
› 2GB GDDR5 memory
– Quadro 5010M
› 96 CUDA cores
› 4GB GDDR5 memory
• T7500: – Tower Case
– Dual-socket Intel Westmere (X56-- processors)
– Up to 192GB memory (12 DIMM slots)
– Two PCIe Gen2 x16 slots
– Five internal SATA or SAS drives › RAID cards available
– GigE on-board, optional 10GigE cards
Dell HPC 9
Rackable Workstation Develop/Tune
• R5500:
– 2U rackable workstation
– Dual-socket Westmere
– 12 DIMM slots (up to 192GB)
– Up to 5 SATA or 6 SAS drives (2.5”)
– Two PCIe Gen2 x16 slots
› Tesla C2070 is an example
Dell HPC 10
Dell M610x blade
• Half-height blade (5U)
• 2S Westmere
• 12 DIMM slots (192GB memory)
• Mezz card for QDR IB, 10GigE
• One double-wide GPU per blade
Dell HPC 11
Dell PowerEdge C410x – Power & Flexibility
• Basically, “Room and board” for 16 GPUs • Theoretical Max. of 16.5 TFLOPs
• Connects up to 8 hosts
• Connects up to 16 PCIe Gen-2 devices (GPGPUs) to hosts
• High density, 3U chassis.
• Flexibility to selecting number of GPGPUs
• Individually serviceable modules
• N+1 1400W Power supplies (3+1)
• N+1 92mm Cooling fans (7+1)
• PCIe switches
• 8 PEX 8647
• 4 PEX 8696
Dell HPC
Dell PowerEdge C410x
• Sixteen (16) x16 Gen-2 Modules
- PCIe Gen-2 x16 compliant - Independently serviceable
Board-to-board connector for X16 Gen 2 PCIe signals and power
Power connector for GPGPU card
GPU card
LED and On/Off
Dell Research Computing
1
2
Dell HPC 13
Learn more about Dell PEC C410x
• How can you dynamically allocate GPUs to host nodes using the Dell PEC C410x?
• Learn more at session S0309 (Thursday 10:30 Room K)
– Dynamically Allocating GPGPU to Host Nodes (servers)
Dell HPC 14
Dell PEC C6100: 4 2S in one chassis
Four 2-Socket Nodes in 2U Intel Westmere-EP
Each Node: 12 DIMMs each
2 GigE (Intel)
1 Daughter Card (PCIe x8)
QDR IB or 10GigE
One PCIe x16 (half-length, half-height)
Optional SAS controller (in-place of IB)
Chassis Design: Hot Plug, Individual Nodes
Up to 12 x 3.5” drives (3 per node)
Up to 24 x 2.5” drives (6 per node)
N+1 Power supplies (1100W or 1400W)
NVIDIA HIC certified
Dell HPC 15
Dell PEC C6145: Two AMD 4S in 2U
• Two 4-Socket Nodes in 2U • 4S AMD Opteron 6200 series
• Each Node: • 32 x DDR3 RDIMMs
• 2 x GbE Intel
• 1 x8 Gen II (custom mezzanine slot)
• QDR IB or 10GigE
• 3 x16 Gen II (low-profile, half height/half-length)
• Chassis Design: • Hot Plug, Individual Nodes
• 24 x 2.5” or 12 x 3.5” HDD
• Redundant Power supplies (1100W or 1400W)
• Embedded x16 HIC and slots for additional HICs
Dell HPC 16
Dell PEC C6220 Four 2S Sandy Bridge in 2U
Four 2-Socket Nodes in 2U Intel Sandy Bridge-EP
Each Node: 16 DIMMs each
2 GigE (Intel)
1 Daughter Card (PCIe Gen 3 x8)
FDR IB or QDR IB or 10GigE
One PCIe G3 x16 (half-length, half-height)
Two node version has two PCIe G3 x16 slots
Optional SAS controller (in-place of IB)
Chassis Design: Hot Plug, Individual Nodes
Up to 12 x 3.5” drives or Up to 24 x 2.5” drives
N+1 Power supplies (1100W or 1400W)
NVIDIA HIC certified
Dell HPC 17
Dell Power R720 First Standard Server with internal GPUs
Two-socket Intel Sandy Bridge-EP 24 DIMM slots (up to 768GB) Dell Select Network Adapters:
4x GigE
2x10GigE + 2xGigE
Intel or Broadcom
7 PCIe Gen 3 slots: Up to two internal GPUS (passive)
PCIe G3 x8 slot for network adapter (e.g. FDR IB)
Up to 4 front-access, hot-swap, PCIe drives Up to 16 drives HIC certified to work with Dell C410x
Dell HPC
Performance and Power Measurements
Dell HPC 19
Dell HPC Engineering
• Performing benchmarks/tests of various GPU applications: – Different host nodes
› C6100, C6145 are called Dell 11G
› C6220 and R720 are called Dell 12G
– Different number of GPUs
– Internal and external GPUs
• Measured performance AND power during tests
• Goal is to understand how applications scale: – Number and type of GPUs
– Host node configuration
• Develop best practices for GPU configurations
Dell HPC 20
Applications
• HPL
• NAMD
• XFDTD
• 3D Oil Reservoir Simulation
• ANSYS Mechanical
Dell HPC 21
Thanks!!!!!
• Dr. Saeed Iqbal
• Shawn Gao
• Onur Celebioglu
• Mark Fernandez, Glen Otero
• Nvidia –Mass Fatica, Stan Posey, Peter Lillian, Bob Cravella,
Travis Wells, et al
Dell HPC
Dell 11G
Dell HPC 23
Dell PowerEdge C6100 + C410x HPL Normalized Results
93.63%
64.91%
56.33%
39.93%
18.18%
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
CPUOnly
CPU + 1M2070
CPU + 2M2070
CPU + 4M2070
CPU + 8M2070
No
rma
liz
ed
Pe
rfo
rma
nc
e (
GF
LO
PS
)
93.63%
64.91%
56.33% 39.93% 18.18%
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
CPU Only CPU + 1M2070
CPU + 2M2070
CPU + 4M2070
CPU + 8M2070
Nro
ma
liz
ed
Po
we
r (W
)
93.63%
64.91%
56.33%
39.93%
18.18%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
CPUOnly
CPU + 1M2070
CPU + 2M2070
CPU + 4M2070
CPU + 8M2070
No
rma
liz
ed
GF
LO
PS
/W
• One HIC from host node to C410x
• Recommended no more than 2 GPUs per HIC
1-node PE C6100, Dual [email protected], 48GB, 1333MHz Memory; C410x has Nvidia M2070’s
Dell HPC
Dell PowerEdge C6145 + C410x HPL Normalized Results
24
84.06%
45.11%
19.74%
12.03%
32.02%
20.11%
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
CPU Only CPU + 1M2070
CPU + 2M2070
CPU + 4M207
CPU + 8M2070
No
rma
liz
ed
Pe
rfo
rma
nc
e (
GF
LO
PS
)
1 HIC
2 HIC
84.06%
45.11%
35.85%
19.74%
12.03%
46.36%
32.02%
20.11%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
CPU Only CPU + 1M2070
CPU + 2M2070
CPU + 4M207
CPU + 8M2070
No
rma
liz
ed
GF
LO
PS
/W
1 HIC
2 HIC
1-node PE C6145, Four [email protected], 128GB, 1333MHz Memory; C410x has Nvidia M2070’s
• One or Two HICs from host node to C410x
• Recommend no more than 2 GPUs per HIC (1 per HIC is better)
Dell HPC 25
Dell PowerEdge C6100 + C410x NAMD Normalized Results
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
CPUOnly
CPU + 1M2070
CPU + 2M2070
CPU + 4M2070
CPU + 8M2070
No
rma
liz
ed
Pe
rfo
rma
nc
e (
da
y/n
s)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
CPU Only CPU + 1M2070
CPU + 2M2070
CPU + 4M2070
CPU + 8M2070
No
rma
liz
ed
Po
we
r (W
)
1-node PE C6100, Dual [email protected], 48GB, 1333MHz Memory; C410x has Nvidia M2070’s
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
CPU Only CPU + 1M2070
CPU + 2M2070
CPU + 4M2070
CPU + 8M2070
No
rma
liz
ed
Pe
rf (
1/d
ay
s/n
s)/W
• STMV data set (1M atoms)
• Two GPUs is best, 4 GPUs is also good (Perf/W). 1 HIC is good
Dell HPC 26
Dell PowerEdge C6100 + C410x XFDTD Normalized Performance
Node PE C6100, Dual [email protected], 48GB, 1333MHz Memory; C410x has 16 M2070 GPUs
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
CPU Only CPU + 1 M2070 CPU + 2 M2070 CPU + 4 M2070 CPU + 6 M2070 CPU + 8 M2070
No
rma
liz
ed
Pe
rfo
rma
nc
e (
1/t
ime
)
• 2-4 GPUs are best
• 2-4 GPUs per x16 HIC
Dell HPC 27
Dell PowerEdge C6100 + C410x 3D Oil Reservoir Simulation (Elastic Model) Normalized Results
CPU CPU CPU CPU
2 GPUs 2 GPUs
4 GPUs
8 GPUs
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
8.28M/2GPUS
16.57M/2GPUs
33.15M/4GPUs
66.34M/8GPUs
No
rma
liz
ed
Pe
rfo
rma
nc
e (
tim
e)
Problem Size/Number of GPUs
CPU CPU CPU CPU
2 GPUS 2 GPUS
4 GPUs
8 GPUs
0.00
1.00
2.00
3.00
4.00
5.00
6.00
8.28M/2GPUS
16.57M/2GPUs
33.15M/4GPUs
66.34M/8GPUs
No
rma
liz
ed
Po
we
r (W
)
Problem Size/Number of GPUs
CPU CPU CPU CPU
2 GPUs 2 GPUs
4 GPUs
8 GPUs
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
8.28M/2GPUS
16.57M/2GPUs
33.15M/4GPUs
66.34M/8GPUs
No
rma
liz
ed
Pe
rf/W
(T
ime
*W)
Problem Size/Number of GPUs
Node PE C6100, Dual [email protected], 48GB, 1333MHz Memory; C410x has 16 M2070 GPUs
• Multiple data sets
• Smaller data sets: 2 GPUs. Larger data sets: 4-8 GPUs (per x16 HIC)
Dell HPC
Dell 12G
Dell HPC 29
Dell PowerEdge R720 HPL-Normalized Results
94.1%
67.8%
60.2%
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
R720 CPUonly
R720 + 1M2090
R720 + 2M2090
No
rma
liz
ed
Pe
rfo
rma
nc
e (
GF
LO
PS
)
94.1%
67.8%
60.2%
0.00
0.50
1.00
1.50
2.00
2.50
3.00
R720 CPU only R720 + 1M2090
R720 + 2M2090
No
rma
liz
ed
Po
we
r (W
)
94.1%
67.8%
60.2%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
R720 CPU only R720 + 1M2090
R720 + 2M2090
No
rma
liz
ed
GF
LO
PS
/W
R720, Dual Intel [email protected], 64GB, 1333MHz Memory (8x 8GB); two internal M2090 GPUs
• Internal GPUs
• 2 GPUs seems like a good configuration (individual x16 slot)
Dell HPC
Compare GFLOPS/watt (Cap 225W) Comparisons ( C6100 & R720)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
0 1 2
GF
LO
Ps/
Wa
tts
M2090 GPUs with 225W power capping
C6100 ( X5670 @ 2.93GHz )
R720 ( E5 - 2680 @ 2.7GHz )
R720 ( Intel @ 2.2GHz )
Dell HPC 31
Dell PowerEdge C6220 + C410x HPL - Normalized Results
92.7%
67.2%
58.7% 34.4%
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
CPU only CPU+1M2090
CPU+2M2090
CPU+4M2090
No
rma
liz
ed
Pe
rfo
rma
nc
e (
GF
LO
PS
)
92.7%
67.2%
58.7%
34.4%
0.00
0.50
1.00
1.50
2.00
2.50
3.00
CPU only CPU+1M2090
CPU+2M2090
CPU+4M2090
No
rma
liz
ed
Po
we
r (W
)
92.7% 67.2%
58.7% 34.4%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
CPU only CPU+1M2090
CPU+2M2090
CPU+4M2090
No
rma
liz
ed
GF
LO
PS
/W
• External GPUs with single HIC
• Two GPUs seems to be sweet spot (two GPUs with 1 x16 HIC)
Dell HPC 32
Comparison of Internal vs. External HPL - Normalized Performance
• Not quite apples-to-apples (PCIe G2 to G3)
94.1%
67.8%
60.2%
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
R720 CPU only R720 + 1 M2090 R720 + 2 M2090
No
rma
liz
ed
Pe
rfo
rma
nc
e (
GF
LO
PS
)
92.7%
67.2%
58.7% 34.4%
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
CPU only CPU+1 M2090 CPU+2 M2090 CPU+4 M2090
No
rma
liz
ed
Pe
rfo
rma
nc
e (
GF
LO
PS
)
Dell HPC 33
Comparison of Internal vs. External HPL - Normalized GFLOPS/W
94.1%
67.8%
60.2%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
R720 CPU only R720 + 1 M2090 R720 + 2 M2090
No
rma
liz
ed
GF
LO
PS
/W
92.7% 67.2%
58.7% 34.4%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
CPU only CPU+1 M2090 CPU+2 M2090 CPU+4 M2090
No
rma
liz
ed
GF
LO
PS
/W
• Not quite apples-to-apples (PCIe G2 to G3)
• Internal GPUs are more effective (perf/W) but external are still very good
Dell HPC
Dell PowerEdge R720 ANSYS Mechanical - Normalized Results – 1 core
34
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
No
rma
liz
ed
Pe
rfo
rma
nc
e (
Tim
e)
1 Core
1 Core + 1 M2090
0.00
0.50
1.00
1.50
2.00
2.50
V14cg-1V14sp-1V14sp-2V14sp-3V14sp-4V14sp-5V14sp-6
No
rma
liz
ed
Pe
rf/W
- (
1/t
ime
)/W
1 Core
1 Core + 1 M2090
• Massive speedup with GPU (up to 3.6x) but not all cases
• Perf/W (efficiency) can be quite good (2 times better)
Dell HPC
Dell PowerEdge R720 ANSYS Mechanical - Normalized Results – 2 cores
35
0.00
0.50
1.00
1.50
2.00
2.50
3.00
No
rma
liz
ed
Pe
rfo
rma
nc
e (
Tim
e)
2 Cores
2 Cores + 1 M2090
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
No
rma
liz
ed
Pe
rf/W
- (
1/t
ime
)/W
2 Cores
2 Cores + 1 M2090
• Less speedup than 1 core case
• Perf/W (efficiency) is still very good
Dell HPC
Dell PowerEdge R720 ANSYS Mechanical - Normalized Results – 4 cores
36
0
0.5
1
1.5
2
2.5
3
V14cg-1V14sp-1V14sp-2V14sp-3V14sp-4V14sp-5V14sp-6
No
rma
liz
ed
Pe
rfo
ram
nc
e (
Tim
e)
4 Cores
4 Cores + 1 M2090
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
V14cg-1V14sp-1V14sp-2V14sp-3V14sp-4V14sp-5V14sp-6
No
rma
liz
ed
Pe
rf/W
- (
1/T
)/W
4 Cores
4 Cores + 1 M2090
• Less speedup than 1 and 2 core cases
Dell HPC
Dell PowerEdge R720 ANSYS Mechanical - Normalized Results – 8 cores
37
0.00
0.50
1.00
1.50
2.00
2.50
No
rma
liz
ed
Pe
rfo
rma
nc
e (
Tim
e)
8 Cores
8 Cores + 1 M2090
8 Cores + 2 M2090
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
No
rma
liz
ed
Pe
rf/W
- (
1/T
)/W
8 Cores
8 Cores + 1 M2090
• Added 2 GPU tests (not much impact on performance)
• Efficiency is not good except for 2 cases (less than 1)
Dell HPC
Dell PowerEdge R720 ANSYS Mechanical - Normalized Results – 16 cores
38
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
No
rma
liz
ed
Pe
rfo
rma
nc
e (
Sp
ee
du
p)
16 Cores
16 Cores + 1 M2090
16 Cores + 2 M2090
0.00
0.20
0.40
0.60
0.80
1.00
1.20
No
rma
liz
ed
Pe
rf/W
- (
1/T
)/W
16 Cores
16 Cores + 1 M2090
• Very little speed improvement
• Efficiency with GPUs is worse than CPUs only
Dell HPC
Dell PowerEdge R720 ANSYS Mechanical – Trends
39
• Choose last 3 cases (best usage of GPUs)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
1 Core 2 Cores 4 Cores 8 Cores 16 Cores
No
rma
liz
ed
Pe
rfo
rma
nc
e (
1/T
)
V14sp-4
V14sp-5
V14sp-6
CPU Only
0.00
0.50
1.00
1.50
2.00
2.50
1 Core 2 Cores 4 Cores 8 Cores 16 Cores
No
rma
liz
ed
Pe
rf/W
- (
1/T
)/W
V14sp-4
V14sp-5
V14sp-6
CPU Only
Dell HPC 40
ANSYS Mechanical observations
• As the number of cores increased:
– The GPU speedup decreases
– Efficiency decreases (at 16 cores it’s not good)
– Cross-over point is 8 or 16 cores (8 is a good rule of thumb)
• Recommended configuration:
– Small CPU core count (no more than 4) with 1 GPU
› With 2 GPUs in node, you can run 2 cases at the same time (uses 8 cores)
– Performance varies with case (solver)
Dell HPC 41
Summary
• Lots of options for GPU configurations – which one is “best”? – How do you define “best”? Performance? Power efficiency? Both?
• Answers vary (depend upon application)
• You don’t have to have a dedicated x16 slot for each GPU for good performance and good efficiency – Many applications shown here illustrate this
• Dell C410x allows GPU Direct for up to 8 GPUs – Other systems do not allow this
Thanks! Questions?
www.dellhpcsolutions.com
www.hpcatdell.com