31
synergy.cs.vt .edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin , Pavan Balaji (ANL), Wu- chun Feng

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng

Embed Size (px)

Citation preview

synergy.cs.vt.edu

Power and Performance Characterization of

Computational Kernels on the GPUYang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng

synergy.cs.vt.edu

Graphic Processing Units (GPU) are Powerful

* Data and image source, http://people.sc.fsu.edu/~jburkardt/latex/ajou_2009_parallel/ajou_2009_parallel.html

synergy.cs.vt.edu

GPU is Increasingly Popular in HPC

Three out of top five supercomputers are GPU-based

synergy.cs.vt.edu

GPUs are Power Hungry

Xeon GTX280 Fermi0

50

100

150

200

250

300

350T

her

mal

Des

ign

Po

wer

(W

atts

)

It is imperative to investigate Green GPU computing

synergy.cs.vt.edu

Green Computing with DVFS on CPUs

Mechanism

Minimizing performance impact Lower voltage and frequency when CPU not in critical

path

What about GPUs?

Power Voltage∝ 2 × Frequency

synergy.cs.vt.edu

What is this Paper about?

Characterize performance and power for various kernels on GPUs Kernels with different compute and memory

intensiveness Various core and memory frequencies

Contributions Reveal unique frequency scaling behaviors on GPUs Provide useful hints for green GPU computing

synergy.cs.vt.edu

Outline

Introduction GPU Overview Characterization Methodology Experimental Results Conclusion & Future Work

synergy.cs.vt.edu

NVIDIA GTX280 Architecture

8

On-chip memory • Small sizes• Fast access

Off-chip memory • Large size• High access latency

Device (Global) Memory

synergy.cs.vt.edu

OpenCL

Write once, run on any GPUs Allow programmer to fully exploit power of

GPUs Compute kernel: function executed on a GPU

OpenCL Device Abstraction

synergy.cs.vt.edu

GPU Frequency Scaling

Two dimensional Compute core frequency and memory frequency

Semi-automatic Dynamic configuration not supported User can only control peak frequencies Automatically switch to idle mode when no

computation

Details not available to public

synergy.cs.vt.edu

Outline

Introduction GPU Overview Characterization Methodology Experimental Results Conclusion & Future Work

synergy.cs.vt.edu

Kernel Selection

High performance of GPUs Massive parallelism (e.g., 240 cores) High memory bandwidth (e.g., 140GB/s)

Three kernels of computational diversity

Compute Intensive

Memory Intensive

Matrix Multiplication

Matrix Transpose

Fast Fourier Transform (FFT)

synergy.cs.vt.edu

Kernel Characteristics

Memory to compute ratio

Instruction throughput

Rmem =#Global_Memory _Transactions

#Computation _ Instructions

Rins =#Computation _ Instructions

GPU _Time

synergy.cs.vt.edu

Kernel Profile

Matrix Multiplication

Matrix Transpose

FFT

Rmem 5.6% 53.7% 8.3%

Rins 203215711 12095895 145165788

synergy.cs.vt.edu

Measurement

Performance Matrix multiplication, FFT: GFLOPS Matrix transpose: MB/s

Energy Whole system when executing the kernel on the GPU

Power Reported using the average power

Energy Efficiency Performance / power

synergy.cs.vt.edu

Outline

Introduction GPU Overview Characterization Methodology Experimental Results Conclusion & Future Work

synergy.cs.vt.edu

Experimental Setup

System Intel Core 2 Quad Q6600 NVIDIA GTX280 1GB memory

Power Meter Watts Up? Pro ES

synergy.cs.vt.edu

Matrix Multiplication - Performance

Mostly affected by core frequency, almost not affected by memory frequency

400 450 500 550 600 650 70085

95

105

115

125

135

145

155

600700800900100011001200

GPU Core Frequency (MHz)

Perf

orm

ance

(GFL

OPS

)

synergy.cs.vt.edu

Matrix Multiplication - Power

Mostly affected by core frequency, slightly affected by memory frequency

400 450 500 550 600 650 700245

255

265

275

285

295

305

315

600700800900100011001200

GPU Core Frequency (MHz)

Pow

er (W

atts)

synergy.cs.vt.edu

Matrix Multiplication - Efficiency

Best efficiency achieved at highest core frequency and relatively high memory frequency

400 450 500 550 600 650 700340

360

380

400

420

440

460

480

500

600700800900100011001200

GPU Core Frequency (MHz)

Pow

er E

ffici

ency

(M

FLO

PS/W

att)

synergy.cs.vt.edu

Matrix Transpose - Performance

Performance dominated by memory frequency

400 450 500 550 600 650 700150

170

190

210

230

250

270

600700800900100011001200

GPU Core Frequency (MHz)

Perf

orm

ance

(MB/

s)

synergy.cs.vt.edu

Matrix Transpose - Power

Higher core frequency increase power consumption (not performance)

400 450 500 550 600 650 700195200205210215220225230235240

600700800900100011001200

GPU Core Frequency (MHz)

Pow

er (W

atts)

synergy.cs.vt.edu

Matrix Transpose - Efficiency

Best efficiency achieved at highest memory frequency and lowest core frequency

400 450 500 550 600 650 700650

750

850

950

1050

1150

1250

600700800900100011001200

GPU Core Frequency (MHz)

Pow

er E

ffici

ency

(KBP

S/W

att)

synergy.cs.vt.edu

FFT - Performance

Affected by both core and memory frequencies

400 450 500 550 600 650 70040455055606570758085

600700800900100011001200

GPU Core Frequency (MHz)

Perf

orm

ance

(GFL

OPS

)

synergy.cs.vt.edu

FFT - Power

Affected by both core and memory frequencies

400 450 500 550 600 650 700225

235

245

255

265

275

285

600700800900100011001200

GPU Core Frequency (MHz)

Pow

er (W

atts)

synergy.cs.vt.edu

FFT - Efficiency

Best efficiency at highest core and memory frequencies

400 450 500 550 600 650 700185

205

225

245

265

285

305

600700800900100011001200

GPU Core Frequency (MHz)

Pow

er E

ffcie

ncy

(GFL

OPS

/w)

synergy.cs.vt.edu

FFT – Two Dimensional Effect

Power (Watts) Efficiency (Mflops/Watt)225

230

235

240

245

250

255

260

265

270

<550, 1200><600, 1000><700, 800>

7%

synergy.cs.vt.edu

Power and Efficiency Range

Power Efficiency0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Matrix MultiplicationMatrix TransposeFFT

synergy.cs.vt.edu

Conclusion & Future Work

To take away Green computing on GPUs are important GPU frequency scaling considerably different than

CPUs

Next Finer-grained level of characterization (e.g., different

types of operations) Experiments on Fermi and AMD GPUs

synergy.cs.vt.edu

Acknowledgment

NSF Center for High Performance Reconfigurable Computing (CHREC) for their support through NSF I/UCRC Grant IIP-0804155;

National Science Foundation for their support partialy through CNS-0915861 and CNS-0916719.

synergy.cs.vt.edu

Questions?